How to maintain external datasets contributions

kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.

Apache License 2.0

91 stars 85 forks source link

How to maintain external datasets contributions #535

Open noklam opened 1 year ago

noklam commented 1 year ago

Description

Why this is raised?

With more incoming datasets PR, it become harder to maintain all the datasets. Particularly for the exotic datasets, we don't have the setup for every possible environment (e.g. snowflake/databricks). This create challenge for maintaining all the datasets since we don't have the re

This also lead to the question "Does every datasets belongs to kedro-datasets?

The answer is no, since there are few popular datasets maintained separately in kedro-mlflow as well.

Possible Action

CSVDataSet is more robust than say ManagedTableDataSet, can we signal this better through our docs? We did something similar to Deployment docs

More Discussion

How to we want to maintain the contributions? How do we draw the line that something should be a separate plugins or going into kedro-datasets Cc @astrojuanlu

Idea raised during retro:

datasets could be maintained as a separate plugins. i.e. kedro-mlflow has its own datasets.

noklam commented 8 months ago

Link: https://github.com/kedro-org/kedro-plugins/issues/517#issuecomment-1911991500

Maybe we can close this ticket?

astrojuanlu commented 8 months ago

kedro-org/kedro#517 was a different (although related) discussion. In the middle of it though, I raised the question "Should we accept every dataset that is in good shape in kedro-datasets?" and the answer seemed to be yes. However, this was at the very end of our meeting and there was nearly not enough time to weigh pros and cons of this.

So I'd say we keep it open.

Having said that though, there's a number of pull requests open already, and I think it's unfair that we hold them because of lack of firm consensus on this topic.

astrojuanlu commented 8 months ago

For example, consider discoverability. The fact that the current monorepo approach already hinders the visibility of the individual plugins, as described in https://github.com/kedro-org/kedro-plugins/issues/401

For datasets inside kedro-datasets, the effect is even larger. On top of that, the actual business logic of custom datasets is hidden behind private methods that don't get documented by default https://github.com/kedro-org/kedro/issues/1936#issuecomment-1717214883

astrojuanlu commented 8 months ago

(And this is aside from the maintenance issues @noklam mentioned)

astrojuanlu commented 8 months ago

I think we are underestimating the maintenance burden of the current approach.

Lots of people in the team have trouble building the docs locally, because one has to install all the dependencies of all datasets for that to work. @rashidakanchwala can attest - she struggled a lot, and now I'm unable to do it myself (troubleshooting some weird conflicts raised by pip).

On the other hand, there have been users in the past that have been confused and couldn't even run the test suite. It happened for https://github.com/kedro-org/kedro-plugins/pull/360 and also for https://github.com/kedro-org/kedro-plugins/pull/435.

I think it's time to seriously consider breaking kedro-datasets apart.

datajoely commented 8 months ago

I do keep wondering if we could have a Low-code dataset contribution workflow on the website that allowed us to accept contributions and manage the test suite for users.

astrojuanlu commented 7 months ago

A user literally ran out of disk space when trying to install kedro-datasets test dependencies while troubleshooting a pip conflict https://github.com/kedro-org/kedro-plugins/issues/597#issuecomment-1981138414

lrcouto commented 5 months ago

A user literally ran out of disk space when trying to install kedro-datasets test dependencies while troubleshooting a pip conflict #597 (comment)

This happened to me this week while running tests to figure out the issues with the kedro-datasets dependencies 😬