huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.22k stars 2.68k forks source link

Version mismatch with multiprocess and dill on Python 3.10 #5613

Open adampauls opened 1 year ago

adampauls commented 1 year ago

Describe the bug

Grabbing the latest version of datasets and apache-beam with poetry using Python 3.10 gives a crash at runtime. The crash is

File "/Users/adpauls/sc/git/DSI-transformers/data/NQ/create_NQ_train_vali.py", line 1, in <module>
    import datasets
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/__init__.py", line 43, in <module>
    from .arrow_dataset import Dataset
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 65, in <module>
    from .arrow_reader import ArrowReader
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 30, in <module>
    from .download.download_config import DownloadConfig
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/download/__init__.py", line 9, in <module>
    from .download_manager import DownloadManager, DownloadMode
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/download/download_manager.py", line 35, in <module>
    from ..utils.py_utils import NestedDataStructure, map_nested, size_str
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 40, in <module>
    import multiprocess.pool
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/pool.py", line 609, in <module>
    class ThreadPool(Pool):
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/pool.py", line 611, in ThreadPool
    from .dummy import Process
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/dummy/__init__.py", line 87, in <module>
    class Condition(threading._Condition):
AttributeError: module 'threading' has no attribute '_Condition'. Did you mean: 'Condition'?

I think this is a bad interaction of versions from dill, multiprocess, apache-beam, and threading from the Python (3.10) standard lib. Upgrading multiprocess to a version that does not crash like this is not possible because apache-beam pins dill to and old version:

Because multiprocess (0.70.10) depends on dill (>=0.3.2)
 and apache-beam (2.45.0) depends on dill (>=0.3.1.1,<0.3.2), multiprocess (0.70.10) is incompatible with apache-beam (2.45.0).
And because no versions of apache-beam match >2.45.0,<3.0.0, multiprocess (0.70.10) is incompatible with apache-beam (>=2.45.0,<3.0.0).
So, because yyy depends on both apache-beam (^2.45.0) and multiprocess (0.70.10), version solving failed.

Perhaps it is not right to file a bug here, but I'm not totally sure whose fault it is. And in any case, this is an immediate blocker to using datasets out of the box.

Possibly related to https://github.com/huggingface/datasets/issues/5232.

Steps to reproduce the bug

Steps to reproduce:

  1. Make a poetry project with this configuration

    [tool.poetry]
    name = "yyy"
    version = "0.1.0"
    description = ""
    authors = ["Adam Pauls <adpauls@gmail.com>"]
    readme = "README.md" 
    packages = [{ include = "xxx" }]
    
    [tool.poetry.dependencies]   
    python = ">=3.10,<3.11"
    datasets = "^2.10.1"
    apache-beam = "^2.45.0"
    
    [build-system]
    requires = ["poetry-core"]  
    build-backend = "poetry.core.masonry.api"
    1. poetry install.
    2. poetry run python -c "import datasets".

Expected behavior

Script runs.

Environment info

Python 3.10. Here are the versions installed by poetry:

•• Installing frozenlist (1.3.3)
  • Installing idna (3.4)
  • Installing multidict (6.0.4)
  • Installing aiosignal (1.3.1)
  • Installing async-timeout (4.0.2)
  • Installing attrs (22.2.0)
  • Installing certifi (2022.12.7)
  • Installing charset-normalizer (3.1.0)
  • Installing six (1.16.0)
  • Installing urllib3 (1.26.14)
  • Installing yarl (1.8.2)
  • Installing aiohttp (3.8.4)
  • Installing dill (0.3.1.1)
  • Installing docopt (0.6.2)
  • Installing filelock (3.9.0)
  • Installing numpy (1.22.4)
  • Installing pyparsing (3.0.9)
  • Installing protobuf (3.19.4)
  • Installing packaging (23.0)
  • Installing python-dateutil (2.8.2)
  • Installing pytz (2022.7.1)
  • Installing pyyaml (6.0)
  • Installing requests (2.28.2)
  • Installing tqdm (4.65.0)
  • Installing typing-extensions (4.5.0)
  • Installing cloudpickle (2.2.1)
  • Installing crcmod (1.7)
  • Installing fastavro (1.7.2)
  • Installing fasteners (0.18)
  • Installing fsspec (2023.3.0)
  • Installing grpcio (1.51.3)
  • Installing hdfs (2.7.0)
  • Installing httplib2 (0.20.4)
  • Installing huggingface-hub (0.12.1)
  • Installing multiprocess (0.70.9)
  • Installing objsize (0.6.1)
  • Installing orjson (3.8.7)
  • Installing pandas (1.5.3)
  • Installing proto-plus (1.22.2)
  • Installing pyarrow (9.0.0)
  • Installing pydot (1.4.2)
  • Installing pymongo (3.13.0)
  • Installing regex (2022.10.31)
  • Installing responses (0.18.0)
  • Installing xxhash (3.2.0)
  • Installing zstandard (0.20.0)
  • Installing apache-beam (2.45.0)
  • Installing datasets (2.10.1)
adampauls commented 1 year ago

Sorry, I just found https://github.com/apache/beam/issues/24458. It seems this issue is being worked on.

adampauls commented 1 year ago

Reopening, since I think the docs should inform the user of this problem. For example, this page says

Datasets is tested on Python 3.7+.

but it should probably say that Beam Datasets do not work with Python 3.10 (or link to a known issues page).

jeromemassot commented 1 year ago

Same problem on Colab using a vanilla setup running : Python 3.10.11 apache-beam 2.47.0 datasets 2.12.0

sergesteban commented 1 year ago

Same problem, py 3.10.11 apache-beam==2.47.0 datasets==2.12.0

boyleconnor commented 1 year ago

I have made a workaround by forcing an install of the version of multiprocess version 0.70.15 (after installing datasets and apache-beam). I can confirm that (on Python 3.10 in this colab notebook) datasets can download pre-processed Wikipedia dumps and can download non-pre-processed dumps using beam_runner="DirectRunner". I don't know if/how other beam_runners can be made compatible.

axelmagn commented 7 months ago

Same problem.

python = "^3.10"
apache-beam = { extras = ["gcp"], version = "2.54.0" }
datasets = "^2.18.0"