kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
92 stars 89 forks source link

Snowflake dataset cannot installed with latest pyarrow 14.0.1 #453

Closed zencircle closed 7 months ago

zencircle commented 10 months ago

Description

Unable to update to pyarrow>=14.0.1 to fix the vulnerabilities https://nvd.nist.gov/vuln/detail/CVE-2023-47248

Context

pyarrow version in the setup file is blocking new version from getting installed https://github.com/kedro-org/kedro/blob/main/setup.py#L47

Steps to Reproduce

Docker compile fails

docker build -t local-kedro .

44.59     kedro[pandas] 0.18.2 depends on kedro 0.18.2 (from https://files.pythonhosted.org/packages/17/7f/48d5c36469bb39701b93d94155196c71a0e9eb6f9c7a1d701ceb1aa7d801/kedro-0.18.2-py3-none-any.whl (from https://pypi.org/simple/kedro/) (requires-python:>=3.7, <3.11))
44.59     The user requested pyarrow>=14.0.1
44.59     kedro[pandas] 0.18.1 depends on pyarrow<7.0 and >=1.0; extra == "pandas"
44.59     The user requested pyarrow>=14.0.1
44.59     kedro[pandas] 0.18.0 depends on pyarrow<7.0 and >=1.0; extra == "pandas"
44.59 
44.59 To fix this you could try to:
44.59 1. loosen the range of package versions you've specified
44.59 2. remove package versions to allow pip attempt to solve the dependency conflict
44.59 
44.59 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
------
Dockerfile:7
--------------------
   5 |     COPY /hmda-etl-pipeline/src/requirements.txt /tmp/requirements.txt
   6 |     RUN pip install --upgrade pip 
   7 | >>> RUN pip install --no-cache -r /tmp/requirements.txt && rm -f /tmp/requirements.txt
   8 |     RUN apt-get update
   9 |     RUN apt-get clean
--------------------
cat requirements.txt 
kedro
black
ruff
pandas
kedro[pandas]
kedro-viz
psycopg2-binary
sqlalchemy<2.0
ipykernel
pytest
pytest-cov
jupyter
requests>=2.31.0 #fix medium CVE
aiohttp
s3fs
certifi>=2023.07.22 #fix critical CVE
wheel>=0.38.1 # fix high
cryptography>=39.0.1  #fix high CVE
urllib3>=1.26.18 #fix high CVE
pyarrow>=14.0.1

Expected Result

Actual Result

-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

kedro_docker@568c50ce860f:/$ python --version
Python 3.9.18
kedro_docker@568c50ce860f:/$ kedro -V
kedro, version 0.18.14

If I do not specify pyarrow, docker image gets compiled successfully with version 6

pip list | grep pyarrow
pyarrow                       6.0.1
astrojuanlu commented 10 months ago

Hi @zencircle, thanks for reporting. The develop branch already dropped the datasets kedro-org/kedro#2126 so this will be fixed by the upcoming 0.19.0 version.

Said datasets moved to kedro-datasets, and all the upper bounds of pyarrow have been lifted, except for snowflake https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/setup.py

I'm moving this to the relevant repo.

astrojuanlu commented 9 months ago

If I understand correctly, we're still pinning the pyarrow version:

https://github.com/kedro-org/kedro-plugins/blob/94019fd6e724af9cbacfc7b7fb9678e897a435d2/kedro-datasets/setup.py#L88-L91

felipemonroy commented 8 months ago

Hi, do we know why we are requiring pyarrow to install the SnowparkTableDataset, which only uses snowpark? I was trying to install the dataset in python 3.11, however pyarrow~=8.0 is only available up to python 3.10.

astrojuanlu commented 8 months ago

It was introduced in #148 but I see no rationale for it. @deepyaman do you remember?

Otherwise we could try removing that dependency. @felipemonroy would you like to send a pull request?

merelcht commented 7 months ago

Fixed in https://github.com/kedro-org/kedro-plugins/pull/538