Open adampauls opened 1 year ago
Sorry, I just found https://github.com/apache/beam/issues/24458. It seems this issue is being worked on.
Reopening, since I think the docs should inform the user of this problem. For example, this page says
Datasets is tested on Python 3.7+.
but it should probably say that Beam Datasets do not work with Python 3.10 (or link to a known issues page).
Same problem on Colab using a vanilla setup running : Python 3.10.11 apache-beam 2.47.0 datasets 2.12.0
Same problem, py 3.10.11 apache-beam==2.47.0 datasets==2.12.0
I have made a workaround by forcing an install of the version of multiprocess
version 0.70.15
(after installing datasets
and apache-beam
). I can confirm that (on Python 3.10 in this colab notebook) datasets
can download pre-processed Wikipedia dumps and can download non-pre-processed dumps using beam_runner="DirectRunner"
. I don't know if/how other beam_runner
s can be made compatible.
Same problem.
python = "^3.10"
apache-beam = { extras = ["gcp"], version = "2.54.0" }
datasets = "^2.18.0"
Describe the bug
Grabbing the latest version of
datasets
andapache-beam
withpoetry
using Python 3.10 gives a crash at runtime. The crash isI think this is a bad interaction of versions from
dill
,multiprocess
,apache-beam
, andthreading
from the Python (3.10) standard lib. Upgradingmultiprocess
to a version that does not crash like this is not possible becauseapache-beam
pinsdill
to and old version:Perhaps it is not right to file a bug here, but I'm not totally sure whose fault it is. And in any case, this is an immediate blocker to using
datasets
out of the box.Possibly related to https://github.com/huggingface/datasets/issues/5232.
Steps to reproduce the bug
Steps to reproduce:
Make a poetry project with this configuration
poetry install
.poetry run python -c "import datasets"
.Expected behavior
Script runs.
Environment info
Python 3.10. Here are the versions installed by
poetry
: