Description

Establish clear criteria and guidelines for determining when a contribution is considered a regular (non-experimental) contribution versus an experimental one. This will help contributors and maintainers understand the expectations and classification of datasets.

Context

517

My thinking around this is as follows:

Regular dataset contributions

Must be useful for substantial group of users, not a super niche dataset that’s only used for e.g. a specific type of bacterial data
Must have 100% test coverage
- Tests must run via CI/CD jobs
Must have doc strings that explain the use of the dataset
Should have working doc strings, e.g. passes doctest, unless complex cloud/DB setup required
Can be used with all versions Python 3.9+
Must work on both Windows and Linux
Uses “stable” dependencies (not expected to be discontinued anytime soon)
Must implement the AbstractDataset and at a minimum the following classes need to be implemented:
- _load
- _save
- _describe
Will be maintained by the Kedro team

Datasets that don’t meet the above criteria

SnowParkDataset
- Not 100% test coverage
- Only works with Python 3.8 (this is what’s stated in the doc strings) or 3.9 (pinned version in setup.py)?
databricks.ManagedTableDataset
- Tests don’t run as part of CI/CD
- Not 100% test coverage
HoloviewsWriter
- Not 100% test coverage
NetCDFDataset
- Not 100% test coverage: problems with mocking S3 operations.
TensorFlowModelDataset
- Not 100% test coverage
- Test run as part of CI/CD, but separate from full test suite, because they’re flaky.

Experimental dataset contributions

Are owned by their primary authors, and the core Kedro team is not responsible for their active maintenance or support with using the datasets
Do not need tests
Do not need to pass doctests
Should have doc strings that explain the use of the dataset
Must implement the AbstractDataset and at a minimum the following classes need to be implemented:
- _load
- _save
- _describe

Are owned by their primary authors, and the core Kedro team is not responsible for their active maintenance or support with using the datasets

Do not need tests

Do we need to run the dataset manually at least? And should the contribution include some guide on how to install the dependencies?

I'd add if there are similar datasets exists already, we should try to merge them instead of having 10 different contributions for the same thing.

When we started discussing about this I had a simpler workflow in mind. I'd like to give some holistic feedback on #581, #582, #583 (this issue) and propose an alternative, simpler way forward.

First, some remarks

Are owned by their primary authors, and the core Kedro team is not responsible for their active maintenance or support with using the datasets

(@merelcht)

I think this can open us up for lots of frustration. The experimental path was meant to lower the barrier of entry and lift some burden from us, but making someone commit to "own" the dataset seems to raise it. If we do so, they might as well keep their dataset closed source, or publish it in their own repo (which is fine, keep reading). And even if they commit to "owning" it, they can leave whenever they want, and this will happen. So going forward we'll need to keep track of which datasets are orphan (hence do a lot of #581 and #582).

I think this is a lot of toil.

From my understanding, the goal is to attract more contributions while not lowering the standard of the 1st party supported datasets.

(@noklam on https://github.com/kedro-org/kedro-plugins/issues/581#issuecomment-1991812045)

At the moment, contributing datasets is really hard, from a technical perspective. #165 was open for half a year, and in the end we had to skip some tests. #435 took ages to merge. #580 is having CI failures just when it was about to be merged.

We are looking at a systemic issue here which has nothing to do with the experimental process. Even if these authors wanted to make these datasets "official"/tier-1/regular, the process would have been quite painful.

The reality is that running the CI for the amount of datasets we have and the current design of datasets, which requires tons of mocking (https://github.com/kedro-org/kedro/issues/1936, https://github.com/kedro-org/kedro/issues/1778), is just hard. Installing all the dependencies of all datasets (essentially needed to run the tests and also to build the docs) is getting more and more difficult.

A user literally ran out of disk space when trying to install kedro-datasets test dependencies while troubleshooting a pip conflict #597 (comment)

(myself in https://github.com/kedro-org/kedro-plugins/issues/535#issuecomment-1981301078)

The problem with "visibility"

At the same time, we seem to be stuck in very long discussions, making it seem like the only valid way to accept a dataset is merging it to kedro-datasets. Which is also misleading, because it's been a well known issue that GitHub monorepos get less visibility #401.

All in all, I think this set of requirements (have a reasonable CI to maintain, have a lightweight governance on this repo, and have everything on kedro-datasets) is impossible to satisfy.

Alternative proposal

Everything that the Kedro TSC is willing to maintain stays in kedro_datasets. These have CI, they're fully documented in https://docs.kedro.org/projects/kedro-datasets/, and in general have the highest quality standards.
Everything else me move to kedro_datasets.experimental. There's no CI for those, and they aren't fully documented. We only keep a list of available datasets in the docs. If anything, we have some doctests or nightly integration checks but if they fail they don't block the CI of the rest of the package. "Experimental" datasets are way for us to say "this is some code, you can depend on it or you can copy paste it, we signal that Kedro theoretically works with that dataset, we inspire you to build your project, don't expect maintenance here".
Going forward, all the community (i.e. non-TSC) contributions are experimental and follow (2). (1), (2), and (3) make #583.
Experimental datasets graduate when someone from the TSC commits to maintain them. This can happen because an existing member takes an experimental dataset, or because we are confident that someone will do a good job ("contributors that have shown regular activity on the project over the prior months and want to become maintainers") and we give them a seat at the table. This makes #581.
When said maintainer leaves the TSC, we decide whether we want to share the load, otherwise we deprecate and demote it in an orderly fashion. This makes #582.

By doing this, we will likely reduce the surface area of kedro-datasets, make it more clear that it's maintained by the TSC, and tie the governance to the existing process we have, without adding more.

Beyond `kedro-datasets`

I insist: we are sending the wrong signal by saying that all datasets need to be in kedro-datasets. So here's what we can do about it:

First, lead with the example: we take TSC datasets to separate repos when appropriate. Case in point: @deepyaman's #560. This does not mean that we have 1 repo per dataset - it means that some datasets, because of marketing or product or special reasons, get their own workspace. I don't think we have to make it a general rule. Ideally it should live in gh:kedro-org, but it's not mandatory: after all, kedro-mlflow and kedro-boot live under @Galileo-Galilei and @takikadiri personal accounts at the moment[^1].
We encourage the community to publish their own datasets. How? By giving them visibility (awesome-kedro, a mention in our datasets docs, we figure it out) and an easy template they can copy. It can either be a GitHub template or a copier template - but regardless, something that helps users getting started with datasets with ease without having to (a) maintain them on our kedro-datasets or (b) put them in the "experimental".

[^1]: I'm of the opinion that we could absolutely move kedro-mlflow to gh:kedro-org if @Galileo-Galilei was happy about it. But that's another story, let's not derail the conversation with my unwarranted digressions.

Must have doc strings that explain the use of the dataset

Should have working doc strings, e.g. passes doctest, unless complex cloud/DB setup required

Must implement the AbstractDataset and at a minimum the following classes need to be implemented:

_load

_save

_describe Can we separate the basic requirement for any contribution, so it's not duplicated/confusing?

Must work on both Windows and Linux I think it's fine to have some things that may not work on Windows, if they provide sufficient value, and effort is made to support it on Windows (without success). Windows support isn't really a requirement for use in production settings.

Will be maintained by the Kedro team Agree with most of the points by @astrojuanlu on this. I think it makes sense to say we do minimal maintenance on these, and look to the community (could be original author) to help with enhancement requests. General improvements would be a low priority for the core team, unless somebody is motivated/personally interested. If CI breaks due to a hard-to-resolve issue, we reserve the right to just skip it or xfail some things, whereas for a core dataset we try to fix with P0/1 priority.

Additional criteria I would like to provide for core datasets vs. experimental:

Core dataset requirements must be resolvable with those of all the other core datasets, together.
- This relaxes the current constraint on every single dataset being resolvable, and allows introducing experimental datasets without worrying as much about restrictive underlying packages.
- The majority of people contributing to Kedro-Datasets will either set up an environment just for the core datasets, or not worry about building an environment with all the dependencies, because they are working on a single experimental dataset. This may require a second set of instructions (contributing to a core dataset vs. contributing to experimental), but this may have been required, anyway.
A bit different than the "Will be maintained by the Kedro team", I think a core dataset design should be "blessed" by the Kedro team. What this means is, Kedro team understands how this backend works, and has opinions on how this should be best designed. For example, for Spark (or even MLFlow, or Ibis, should they become core), the team delves into understanding how these systems work. However, if somebody provides a Meteostat (random example) dataset, anything that looks reasonable can pass into experimental, without the team having to delve into how Meteostat works and what a good design could be.

Everything else me move to kedro_datasets.experimental. There's no CI for those, and they aren't fully documented. We only keep a list of available datasets in the docs. If anything, we have some doctests or nightly integration checks but if they fail they don't block the CI of the rest of the package. "Experimental" datasets are way for us to say "this is some code, you can depend on it or you can copy paste it, we signal that Kedro theoretically works with that dataset, we inspire you to build your project, don't expect maintenance here".

I would still like to see a higher standard than "random dataset implementation found on the internet." People will still use a dataset in Kedro-Datasets, if it's experimental, and I think we don't want to say, "I have no idea if this even runs, or how to use it." The nightly build process seems fine (to make CI lighter), but I think it could largely contain a similar test suite?

Going forward, all the community (i.e. non-TSC) contributions are experimental and follow (2). (1), (2), and (3) make Decide on definitions of regular and experimental contributions #583.

Sounds good.

Experimental datasets graduate when someone from the TSC commits to maintain them. This can happen because an existing member takes an experimental dataset, or because we are confident that someone will do a good job ("contributors that have shown regular activity on the project over the prior months and want to become maintainers") and we give them a seat at the table. This makes Determine graduation process for contributions #581.

I don't think anybody should become a maintainer because of work on a single dataset; however, it can be evidence combined with other things that they could be considered to become a maintainer.

When said maintainer leaves the TSC, we decide whether we want to share the load, otherwise we deprecate and demote it in an orderly fashion. This makes Determine demotion process of dataset contributions #582.

Will there be primary (TSC) ownership listed for core datasets, then? Or at least the more "niche" ones? Otherwise, how do you know who owned it to begin with. I think this is fine, just checking. We can easily do this with CODEOWNERS. :)

First, lead with the example: we take TSC datasets to separate repos when appropriate. Case in point: @deepyaman's https://github.com/kedro-org/kedro-plugins/pull/560. This does not mean that we have 1 repo per dataset - it means that some datasets, because of marketing or product or special reasons, get their own workspace. I don't think we have to make it a general rule. Ideally it should live in gh:kedro-org, but it's not mandatory: after all, kedro-mlflow and kedro-boot live under @Galileo-Galilei and @takikadiri personal accounts at the moment1.

Ehh... I still disagree with this, unless we take a different structure (i.e. other datasets in kedro-pandas, kedro-ibis, kedro-polars). I see no convincing argument thus far why ibis.TableDataset should be in a separate repo.

kedro-boot isn't a dataset. kedro-mlflow is a set of datasets, as well as an ecosystem, including plugin, around working with MLFlow. At the point we build a broader ecosystem around integrating Kedro and Ibis, it could make sense to move to a separate plugin.

The results of the survey are in and they are as follows:

Characteristics of regular datasets

1. Datasets that the Kedro TSC is willing to maintain: 100% agreement (13/13) 2. Datasets that are fully documented: 92% agreement (12/13) 3. Datasets that have working doctests (unless complex cloud/DB setup required): 85% agreement (11/13) 4. Datasets that are run as part of regular CI/CD jobs: 100% agreement (13/13) 5. Datasets that have 100% test coverage: 77% agreement (10/13) 6. Datasets that support all Python versions under NEP 29 (3.9+ currently): 8 agree (62%), 5 disagree (38%) 7. Datasets that work on MacOs, Linux and Windows: 7 agree (54%), 6 disagree (46%)

There's strong agreement on the first 5 points, and slight majority agreement on the last 2 points. With that, I propose to have the first 5 as "must-haves" for a regular datasets and the last 2 as "should-haves". This will be described clearly in the new contribution docs https://github.com/kedro-org/kedro-plugins/issues/579.

Characteristics of experimental datasets

1. Datasets that the Kedro TSC doesn’t want to maintain: 100% agreement (13/13) 2. Datasets that aren’t fully documented, but should have doc strings that explain the use of the dataset: 92% agreement (12/13) 3. Datasets that don’t run as part of the regular CI/CD jobs, meaning no tests/test coverage/doctests/Python version/OS system requirements: 85% agreement (11/13)

There's strong agreement on all these points and so they will be included as is in the new contribution docs https://github.com/kedro-org/kedro-plugins/issues/579.

Statements on process

Contribution

1. Any new contribution will be considered as an experimental contribution: 8 agree (62%), 5 disagree (38%)

Slight majority agreement, but this is quite a fundamental point of the new process so I suggest that every new contribution should be considered independently and the TSC decides if it's a regular or experimental contribution instead of making it experimental by default.

Graduation

1. Anyone (TSC members + users) can trigger a graduation process: 100% agreement (13/13) 2. We need 2/3 approval from the TSC to initiate a review/merge to the regular datasets space: 69% agreement (9/13) 3. A dataset can only graduate when it meets all requirements of a regular dataset: 92% agreement (12/13)

Demotion

1. The demotion process will be triggered by someone from the TSC when a dataset isn’t deemed fit as a regular contribution anymore. 92% agreement (12/13) 2. We need 2/3 approval from the TSC to initiate a review/merge to the experimental datasets space.: 69% agreement (9/13)

Majority agreement on all points, but there were several comments saying that 2/3 approval from the TSC is too rigid and hard to get increasing the difficulty of contributing. So instead, I'll change this to be a 1/2 approval from the TSC. All of that will be included in the new contribution docs

Still to discuss:

[ ] Should the author of an experimental dataset be mentioned on it? e.g. adding name/email in the class description? (I've looked into using CODEOWNERS for this, but it doesn't look like you can add CODEOWNERS per file.)
[ ] Dependency resolution requirements for regular and experimental datasets.

Thanks a lot for the summary @merelcht !

Datasets that work on MacOs, Linux and Windows: 7 agree (54%), 6 disagree (46%)

Are we setting any minimum requirements here, like "it must work on Linux"? Or would we consider datasets that, say, only work on Windows?

I think @deepyaman has voiced his opinion somewhere that Windows specifically shouldn't be a hard requirement, which I agree with. Maybe more clarity on this specific point would be helpful.

So instead, I'll change this to be a 1/2 approval from the TSC. All of that will be included in the new contribution docs

The switch to 1/2 TSC approval instead of 2/3 would be for demotion and graduation? (Including adding new datasets?)

I've looked into using CODEOWNERS for this, but it doesn't look like you can add CODEOWNERS per file.

Indeed all examples from https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners#codeowners-syntax are for wildcards or directories, but maybe it works for individual files?

Are we setting any minimum requirements here, like "it must work on Linux"? Or would we consider datasets that, say, only work on Windows?

I think @deepyaman has voiced his opinion somewhere that Windows specifically shouldn't be a hard requirement, which I agree with. Maybe more clarity on this specific point would be helpful.

Well yes, because we are saying a dataset should be able to run as part of CI, so that means it should at least work for one of the OS setups we have in CI. We can phrase it as "must" work on at least one of ... and ideally "should" work on all of them.