kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
91 stars 83 forks source link

[Spike] Decide new name for `kedro-datasets` optional dependencies #313

Closed astrojuanlu closed 6 months ago

astrojuanlu commented 1 year ago

The root cause of the problems we had last week (see #306) with the optional dependency group names in pyproject.toml is that kedro-datasets extras names just don't align with Python packaging standards: https://packaging.python.org/en/latest/specifications/core-metadata/#provides-extra-multiple-use

Provides-Extra (multiple use) Changed in version 2.3: PEP 685 restricted valid values to be unambiguous (i.e. no normalization required). ... A string containing the name of an optional feature. A valid name consists only of lowercase ASCII letters, ASCII numbers, and hyphen.

Therefore, pip install kedro-datasets[pandas.CSVDataSet] is out of line with current packaging standards.

From https://peps.python.org/pep-0685/:

When comparing extra names, tools MUST normalize the names being compared using the semantics outlined in PEP 503 for names: re.sub(r"[-_.]+", "-", name).lower() The core metadata specification will be updated such that the allowed names for Provides-Extra matches what PEP 508 specifies for names.

So, why was this working before? See also in the PEP:

Tools generating metadata MUST raise an error if a user specified two or more extra names which would normalize to the same name.

The reason is: setuptools hasn't fully implemented PEP 685 yet: https://github.com/pypa/setuptools/issues/3586

So, our mangling of extras in setup.py before https://github.com/kedro-org/kedro-plugins/pull/263 worked, but after transitioning to self-referential extras, we fell victims of name normalization.

I'm voting against switching to Poetry, Hatch, PDM, or any other system that allows this behavior, because it's going to bite us in the future. We need a short-term solution (either revert #263 or go for @DimedS #307) and a long term solution (possibly deprecating extras names with dots and offer an alternative syntax for our users).

Originally posted by @astrojuanlu in https://github.com/kedro-org/kedro-plugins/issues/306#issuecomment-1680094495

astrojuanlu commented 10 months ago

Notice that this is starting to creep up as we do dynamic inclusion of requirements.txt in projects pyproject.toml. This happened to me yesterday after installing a wheel coming from a kedro package:

❯ pip install ./spaceflights-namespaces/dist/spaceflights_namespaces-0.1-py3-none-any.whl                                                                                                                                                                           (advkedro310) 
Looking in indexes: https://juan_luis_cano%40mckinsey.com:****@mckinsey.jfrog.io/artifactory/api/pypi/python/simple
Processing ./spaceflights-namespaces/dist/spaceflights_namespaces-0.1-py3-none-any.whl
...
WARNING: kedro-datasets 1.3.0 does not provide the extra 'pandas-csvdataset'
WARNING: kedro-datasets 1.3.0 does not provide the extra 'pandas-exceldataset'
WARNING: kedro-datasets 1.3.0 does not provide the extra 'pandas-parquetdataset'

Contents of requirements.txt:

kedro-datasets[pandas.CSVDataSet, pandas.ExcelDataSet, pandas.ParquetDataSet]~=1.0

(coming from https://github.com/kedro-org/kedro-starters/blob/38e81fba49702d0c8851ad4dcd488bafd08f1a36/spaceflights/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/requirements.txt#L8)

astrojuanlu commented 9 months ago

Another user confused by this https://linen-slack.kedro.org/t/16176249/requirements-txt-line-11-3-8-2-warning-kedro-datasets-2-0-0-#72d04177-c93b-4199-8853-da3d59aa52e6

astrojuanlu commented 7 months ago

Extras with illegal names are not being pulled anymore in pip 23.3 onwards https://github.com/kedro-org/kedro-plugins/issues/553

This was already marked as high, but now it's affecting users for real cc @noklam @ankatiyar

astrojuanlu commented 7 months ago

The workaround for now is to tell users to install, say, kedro-datasets[pandas] instead of kedro-datasets[pandas.ParquetDataset], since the former is legal and the latter isn't. But still, this would pull lots of unneeded dependencies.

noklam commented 7 months ago

Sorry for the delayed response, and thank you for looking into this. Since you've been able to recreate the problem, it doesn't sound like you still need me to post the stacktrace. I've been using pipenv to manage my environments. I'm actually a bit confused right now ... I just created a new kedro project. Piopenv sees the requirements.txt file and uses it. Previously this didn't seem to install pyarrow or fastparquet, and pandas displayed a warning message that one of those two packages was going to be a non-optional requirement. But my current project did in fact install pyarrow.

Another user have a problem. At first he has a bug where fastparquet is installed but ParquetDataset wouldn't work with a "file is closed" error. This is a legit bug itself.

But pyarrow should be the industry standard by now, he arrives this error because pip install kedor-datasets[optional] is missing the dependencies, and the error suggest user to install fastparquet or pyarrow.

astrojuanlu commented 7 months ago

The key decision here is: do we still want to provide an optional group per-dataset?

If yes, we have to decide a naming scheme. If no, that's a breaking change (kedro-datasets 3.0?) but also the current non-standard names don't work anyway. And besides, the dataset groups already have legal names and we don't need to change those (kedro-datasets[pandas] etc).

astrojuanlu commented 6 months ago

From the Airflow 2.8.2 release notes: https://github.com/apache/airflow/releases/tag/2.8.2

Airflow extras now get extras normalized to - (following PEP-685) instead of _ and . (as it was before in some extras). When you install airflow with such extras (for example dbt.core or all_dbs) you should use - instead of _ and ..

https://github.com/apache/airflow/pull/36537

This is exactly our problem.

astrojuanlu commented 6 months ago

Fixed by #570, will see the light in the next release 👍🏽