Closed iamelijahko closed 4 weeks ago
leveraging GitHub branches for comprehensive version control. GitHub improves versioning efficiency by tracking and storing only the changes made, rather than creating complete copies of files, thus saving significant storage space.
Can you clarify this? How is it possible to use Github to version data?
Just to keep this as a potential solution for the challenges of using timestamps, maybe we can use ULID as version number format?
Thanks a lot for the summary @iamelijahko ! Could you update it with
I would like to add a bit more color to the synthesis @iamelijahko has already provided.
versioned: true
datasets on open source projects.https://github.com/kedro-org/kedro/network/dependents shows 2 439 repositories, and this query shows 154 files. That's an upper bound of approximately ~6 % of open repositories using versioned datasets, and this is without discarding those that are mostly a copy-paste of the spaceflights tutorial.
--load-versions
in our telemetry.Out of 3 537 184 total kedro run
commands, only 1 644 included --load-versions
, ~0.05 %.
https://linen-slack.kedro.org/?threads%5Bquery%5D=%22versioned%3A%20true%22 shows 35 results. It's difficult to assess how many "questions" (~threads) are, but for reference, "dataset" yields 877 results, "plugin" yields 270 results, "node" gives 731 results. Searching for "*" gives 3 773 results. This is roughly ~1 % of the messages.
For example, https://github.com/kedro-org/kedro/issues/4028#issuecomment-2315318257 states
just for your UX research transparency, we now completely moved away from versioning and instead have a
RUN_ID
env variable that we pick up inglobals.yaml
and prefix all pipeline paths with that. we found this approach (all data of a version bundled under one path) to be preferable.
Read @iamelijahko's thorough market analysis in https://github.com/kedro-org/kedro/wiki/Market-research-on-versioning-tools
For example, these 5 issues with the Polars datasets:
https://github.com/kedro-org/kedro-plugins/issues/789 https://github.com/kedro-org/kedro-plugins/issues/702 https://github.com/kedro-org/kedro-plugins/issues/625 https://github.com/kedro-org/kedro-plugins/issues/590 https://github.com/kedro-org/kedro-plugins/issues/444
are all blocked because of how we use fsspec for our versioning.
(Disclaimer: I opened 3 of them, but 1 comes directly from a user question and the other one has supporting evidence that other users are affected)
Given the above, the comparatively large number of recommendations coming out of this research, and the need to allocate our limited resources efficiently, it becomes crucial not only to prioritise which ones to tackle, but more importantly to give a coherent vision of how do we want the versioning workflow in Kedro to be.
As such, I would like us to pick between these two strategies:
Under one of these two optics, I believe it will be easier to interpret the recommendations at the top of this thread. And it might also inform how do we approach the last big part of the "Kedro I/O redesign", custom dataset creation https://github.com/kedro-org/kedro/issues/1936
We've had extensive discussions about this in the past weeks. Here is a summary of where we're at and what are the proposed next steps.
Draft a path towards the deprecation of AbstractVersionedDataset
and its replacement by something leaner and better.
See summary at https://github.com/kedro-org/kedro/issues/4129#issuecomment-2330282811
Some extra points on top of what I already wrote:
Journal
, which was deprecated and removed in Kedro 0.18.0 https://github.com/kedro-org/kedro/issues/757#issuecomment-842404512Journal
was Experiment Tracking in Kedro-Viz (more on that below)Several concerns were raised:
AbstractVersionedDataset
is the best we can come up with.AbstractVersionedDataset
.AbstractVersionedDataset
is ingrained in almost every layer of Kedro, and as such if we ever decide to actually get rid of it, it will be a considerable amount of work.I love the way you’ve written out what we should do next! I completely agree with the idea of proposing a workstream to track and better understand who exactly is using Kedro-Viz. This could offer valuable insights into how adoption plays out across different user segments, particularly when narrowing down experiment tracking.
That being said, I do have some concerns about how we're identifying a Kedro-Viz user, especially on Heap. I was reviewing some download data to get a better sense of usage, and there seems to be a significant discrepancy that might be worth investigating further. For context, Kedro-Viz has around 4 million downloads, whereas the Kedro framework itself is sitting at roughly 17 million downloads. That puts Kedro-Viz usage at about 23-24% relative to the core framework, which feels more aligned with what we’d expect.
However, with the changes we've made to telemetry, Kedro-Viz is now being reported as used by only 0.7% of total Kedro users—this just doesn’t add up. I know @rashidakanchwala has looked into this previously, but there’s definitely something odd going on here that still needs to be clarified. It could be a propagation issue affecting data from Heap all the way to our Snowflake instance.
Also, adding to the complexity, I find it less likely that Kedro-Viz is being heavily used in production environments, while Kedro’s usage numbers might be inflated by CI/CD pipelines. So, while the download figures likely reflect more accurate user numbers, the current telemetry data seems to be painting an unclear picture.
Moved the content to https://github.com/kedro-org/kedro/wiki/Versioning-research
What Should We Do Next?
• Integration with Leading Tools: Consider integrating with leading tools like Delta Lake, Apache Hudi, Apache Iceberg for data management, Git for code versioning, and MLflow and DVC for model versioning. Users have reported that Kedro's dataset versioning faces compatibility issues on platforms such as Databricks and Palantir Foundry, reducing its versatility and leading to redundancy with more mature platforms. Refer to market research for insights on how other tools support versioning in data, code, and models.
• Alignment with Data Lakehouse Concepts: The industry's enthusiasm for the data lakehouse concept, which includes features like versioning and time travel, doesn't fully align with Kedro's current design, creating challenges for integration and complementarity.
• Interaction Between Code and Data Versions: Consider how code and data versions interact, potentially creating non-linear branches. This would enable better tracing and auditing by identifying which code version produced which data version, and allow branching from specific points in time, thus addressing the multidimensional aspects of versioning.
• Refer to this Miro board for various artefacts versioning.
• E.g. PMPx team implemented a GitHub-based versioning in Kedro to track entire pipelines, leveraging GitHub branches for comprehensive version control. GitHub improves versioning efficiency by tracking and storing only the changes made, rather than creating complete copies of files, thus saving significant storage space.
• Single Number Version Tracking: Users need a single version number that maps to the corresponding versions of the model, data, and code. This approach simplifies tracking and ensures compatibility, eliminating the complexity of managing multiple version numbers.
• Customized Version Names: Consider allowing users to set up customized version names, such as incorporating specific parameters.
• Automatic Logging: Implement automatic logging of key parameters and metrics with each version to maintain a complete historical context.
• Detailed Metadata Logging: Include detailed metadata with each version, such as data size and key parameters, to provide a comprehensive record.
• Maintain Historical Files: Consider keeping all historical versions of files with attributes for easy lookup without needing additional functions.
• Refer to this Miro board for user journey on Kedro versioning.
• Clear Documentation: Document what changes were made in each version, including any parameter adjustments or data modifications.
• Collaborative (Sharing) Versioning in Managed Analytics: Ensure multiple users can access versioned outcomes easily, avoid local machine conflicts, and utilize platforms like GitHub for effective collaborative versioning.
Priority matrix (Miro board)
Artefacts: What to Track?
Reproducing runs in Kedro is challenging due to incomplete capture of code, parameters, and data, hindering full reproducibility. Granular versioning across data types could improve this, despite Kedro's limitations. Miro link: https://miro.com/app/board/uXjVK9U8mVo=/?moveToWidget=3458764597910279898&cot=14
User journey
Miro link: https://miro.com/app/board/uXjVK9U8mVo=/?moveToWidget=3458764596155374065&cot=14
Data
From the user interviews, data versioning involves tracking and managing different versions of datasets over time, allowing for consistent results even when code remains unchanged. It typically includes handling large tables and unstructured data by storing snapshots or slices at specific points, enabling historical analysis. While unstructured data is often versioned by copying versions, semi-structured data may require specialized algorithms, and large datasets demand careful management due to their complexity.
Painpoints: Data
--disable-versioning
flag in Kedro's CLI to prevent unnecessary version creation, tag important outputs, and simplify storage engine selection with compatible options like Apache Hudi or Delta.catalog.load("df", version="last")
, enhancing workflow efficiency and intuitive version management.