kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Package `kedro.extras.datasets` into its own `kedro-datasets` package #1457

Closed idanov closed 10 months ago

idanov commented 2 years ago

Problem

Currently Kedro as a framework and kedro.extras.datasets are one package and this has caused a number of issues:

Staged approach

Ripping kedro.extras.datasets from the framework is itself a breaking change, but we can phase this out in two stages:

  1. Make a separate package and alias everything in kedro.extras.datasets to that new package, including installing kedro[xxx]
  2. When we ship 0.19, we can officially remove kedro.extras.datasets and now both kedro and kedro-datasets will be installed separately

Tasks for first stage (0.18.X)

Tasks for second stage (preparing release 0.19)

antonymilne commented 2 years ago

This is great! 🚀 Just to wanted another motivation for doing this, and also some more tasks that will be needed. Datasets and their dependencies are the number 1 cause of random CI failures which have slowed us down quite a lot. I presume that your intention was that all the dataset tests should move to kedro-datasets also? If so then I'd add to "tasks for first stage":

Also I think we should have:

One small question: were you thinking of kedro-datasets living in kedro-plugins or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.

merelcht commented 2 years ago

One small question: were you thinking of kedro-datasets living in kedro-plugins or a whole new repo? It's not a plugin in the strict sense of the word since it doesn't use entrypoints, but maybe it's easier to maintain if it's in the monorepo.

I like this suggestion! True kedro-datasets isn't technically a plugin, but if we add it to kedro-plugins we can make use of the already existing CI setup and add in automatic push to pypi for all plugins in one go.

Some more things we should do as part of never work in kedro.extras.datasets is make sure this is clear to open source contributors by adding it to the contribution guide, and at the top of every dataset class and perhaps also in the PR template.

noklam commented 2 years ago

Follow up on some unresolved issues discussed in https://github.com/kedro-org/kedro-plugins/pull/38

  1. How will docs work for kedro-datasets? We will need to generate documentation (at least the dataset part) from kedro-datasets repo.
  2. Deepyaman has an idea with namespace package (i.e. we keep the namespace kedro.extra.datasets in the kedro-datasets repo.
idanov commented 1 year ago

Here's just a wild idea, I've played a bit with Python's meta_path and came up with an interesting PoC about aliasing imports globally as shown here:

import sys
import importlib
from importlib.abc import MetaPathFinder

class Custom(MetaPathFinder):
    def find_spec(fullname, path, target=None):
        if not 'kedro.datasets' in fullname:
            return None
       return importlib.util.find_spec('kedro_datasets')

sys.meta_path.append(Custom)

import kedro.datasets as kd

The code here allows you to import anything from kedro_datasets as if it were kedro.datasets. I did not explore it enough for potential problems, but maybe we can investigate whether we can somehow create the alias kedro.datasets work out of the box for any Kedro project? WDYT?

The modification of the meta_path can happen in configure_project so it works in all entrypoints. I am not sure how it will work with language servers or IDE support though...

deepyaman commented 1 year ago

@idanov That's very interesting, and I didn't know about this functionality. ~That being said, if we're going to investigate this, I think we should prioritize spending some time to make it work using namespace packages, because that would ideally be the "standard" solution to the problem (that doesn't involve path hacking).~ I just saw some of the discussion around why namespace packages may not work in https://github.com/kedro-org/kedro/issues/1758, and even though I don't fully understand all of the reasons having not spent time on it (except for some of the challenges with pip install -e), this could be a good alternative.

merelcht commented 10 months ago

Closing this because all sub-tasks are done and Kedro 0.19.0 and kedro-datasets 2.0.0 will be released soon.