huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.96k stars 2.62k forks source link

Dataset with streaming doesn't work with proxy #6992

Open YHL04 opened 2 months ago

YHL04 commented 2 months ago

Describe the bug

I'm currently trying to stream data using dataset since the dataset is too big but it hangs indefinitely without loading the first batch. I use AIMOS which is a supercomputer that uses proxy to connect to the internet. I assume it has to do with the network configurations. I've already set up both HTTP_PROXY and HTTPS_PROXY. streaming = False works fine.

Steps to reproduce the bug

use load_dataset with streaming = True in AIMOS

Expected behavior

does not hang indefinitely and loads batches to start training run

Environment info

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge _pytorch_select 2.0 cuda_2 https://ftp.osuosl.org/pub/open-ce/1.10.0 abseil-cpp 20220623.0 h9888cd1_6 conda-forge absl-py 1.0.0 py311h399429b_0 https://ftp.osuosl.org/pub/open-ce/1.10.0 aiofiles 23.2.1 pyhd8ed1ab_0 conda-forge aiohttp 3.8.6 py311hf118e41_0 aiosignal 1.2.0 pyhd3eb1b0_0 archspec 0.2.3 pyhd8ed1ab_0 conda-forge arrow-cpp 11.0.0 ha3edaa6_5_cpu conda-forge async-timeout 4.0.2 py311h6ffa863_0 attrs 23.1.0 py311h6ffa863_0 av 10.0.0 py311he6153ed_2 https://ftp.osuosl.org/pub/open-ce/1.10.0 aws-c-auth 0.6.24 hb81f6d7_5 conda-forge aws-c-cal 0.5.20 h3c2b4d9_6 conda-forge aws-c-common 0.8.11 h4194056_0 conda-forge aws-c-compression 0.2.16 ha19333d_3 conda-forge aws-c-event-stream 0.2.18 h12a9399_6 conda-forge aws-c-http 0.7.4 ha2cde00_2 conda-forge aws-c-io 0.13.17 h9189062_2 conda-forge aws-c-mqtt 0.8.6 h40d1a04_6 conda-forge aws-c-s3 0.2.4 hbdbe4f0_3 conda-forge aws-c-sdkutils 0.1.7 ha19333d_3 conda-forge aws-checksums 0.1.14 ha19333d_3 conda-forge aws-crt-cpp 0.19.7 hd018011_7 conda-forge aws-sdk-cpp 1.10.57 hb9575ba_4 conda-forge blas 1.0 openblas blinker 1.8.2 pyhd8ed1ab_0 conda-forge boltons 23.0.0 py311h6ffa863_0 boost-cpp 1.82.0 h25e6d66_2 bottleneck 1.3.5 py311h34f6284_0 brotli 1.0.9 hf118e41_7 brotli-bin 1.0.9 hf118e41_7 brotli-python 1.0.9 py311h4a02239_7 bzip2 1.0.8 h7b6447c_0 c-ares 1.19.1 hf118e41_0 ca-certificates 2024.6.2 h0f6029e_0 conda-forge cachetools 5.3.3 pyhd8ed1ab_0 conda-forge certifi 2024.6.2 pyhd8ed1ab_0 conda-forge cffi 1.15.1 py311hf118e41_3 charset-normalizer 2.0.4 pyhd3eb1b0_0 click 8.1.7 unix_pyh707e725_0 conda-forge conda 24.5.0 py311h1af927a_0 conda-forge conda-content-trust 0.2.0 py311h6ffa863_0 conda-libmamba-solver 23.11.1 py311h6ffa863_0 conda-package-handling 2.2.0 py311h6ffa863_0 conda-package-streaming 0.9.0 py311h6ffa863_0 contourpy 1.0.5 py311h25e6d66_0 cryptography 41.0.3 py311hb0e80e7_0 cudatoolkit 11.8.0 hedcfb66_13 conda-forge cudnn 8.9.2_11.8 h9ceb136_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 cycler 0.11.0 pyhd3eb1b0_0 datasets 2.12.0 py311h6ffa863_0 dill 0.3.6 py311h6ffa863_0 distro 1.9.0 pyhd8ed1ab_0 conda-forge ffmpeg 4.2.2 opence_0 https://ftp.osuosl.org/pub/open-ce/1.10.0 filelock 3.9.0 py311h6ffa863_0 fmt 9.1.0 h25e6d66_0 fonttools 4.25.0 pyhd3eb1b0_0 freetype 2.12.1 hd23a775_0 frozendict 2.4.4 py311hb02d432_0 conda-forge frozenlist 1.4.0 py311hf118e41_0 fsspec 2023.9.2 py311h6ffa863_0 gflags 2.2.2 he6710b0_0 giflib 5.2.1 hf118e41_3 glog 0.6.0 hbe088e0_0 conda-forge gmp 6.3.0 h46f38da_0 conda-forge gmpy2 2.1.5 py311h2758da7_1 conda-forge google-auth 2.30.0 pyhff2d567_0 conda-forge google-auth-oauthlib 0.5.3 pyhd8ed1ab_0 conda-forge grpc-cpp 1.51.1 h8ba971d_1 conda-forge grpcio 1.54.3 py311h414e0d3_0 https://ftp.osuosl.org/pub/open-ce/1.10.0 huggingface_hub 0.17.3 py311h6ffa863_0 icu 73.1 h4a02239_0 idna 3.4 py311h6ffa863_0 importlib-metadata 6.0.0 py311h6ffa863_0 jinja2 3.1.4 pyhd8ed1ab_0 conda-forge jpeg 9e hf118e41_1 jsonpatch 1.32 pyhd3eb1b0_0 jsonpointer 2.1 pyhd3eb1b0_0 kiwisolver 1.4.4 py311h4a02239_0 krb5 1.20.1 hc019ccd_1 lame 3.100 hb283c62_1003 conda-forge lcms2 2.12 h2045e0b_0 ld_impl_linux-ppc64le 2.38 hec883e6_1 lerc 3.0 h29c3540_0 leveldb 1.23 h24532b4_1 conda-forge libabseil 20220623.0 cxx17_h9235812_6 conda-forge libarchive 3.6.2 hd8ab008_2 libarrow 11.0.0 h837770b_5_cpu conda-forge libboost 1.82.0 haf51a6a_2 libbrotlicommon 1.0.9 hf118e41_7 libbrotlidec 1.0.9 hf118e41_7 libbrotlienc 1.0.9 hf118e41_7 libcrc32c 1.1.2 h3b9df90_0 conda-forge libcurl 8.4.0 h4d62439_0 libdeflate 1.17 hf118e41_1 libedit 3.1.20221030 hf118e41_0 libev 4.33 h140841e_1 libevent 2.1.10 h19c23f1_4 conda-forge libexpat 2.6.2 h46f38da_0 conda-forge libffi 3.4.4 h4a02239_0 libgcc-ng 13.2.0 h31e42bb_10 conda-forge libgfortran-ng 11.2.0 hb3889a9_1 libgfortran5 11.2.0 h1234567_1 libgomp 13.2.0 h31e42bb_10 conda-forge libgoogle-cloud 2.7.0 h11140b6_1 conda-forge libgrpc 1.51.1 h4d29a31_1 conda-forge libmamba 1.5.3 h7c6fafd_0 libmambapy 1.5.3 py311h828bf7b_0 libnghttp2 1.57.0 h44e5816_0 libnsl 2.0.1 ha17a0cc_0 conda-forge libopenblas 0.3.23 hc5a31fb_2 https://ftp.osuosl.org/pub/open-ce/1.10.0 libopus 1.3.1 h4e0d66e_1 conda-forge libpng 1.6.39 hf118e41_0 libprotobuf 3.21.12 h1776448_0 https://ftp.osuosl.org/pub/open-ce/1.10.0 libsolv 0.7.24 h0f529ac_0 libsqlite 3.45.3 hd4bbf49_0 conda-forge libssh2 1.10.0 h50fa78f_2 libstdcxx-ng 13.2.0 h262982c_10 conda-forge libthrift 0.18.0 h82f1162_0 conda-forge libtiff 4.5.1 h4a02239_0 libutf8proc 2.8.0 hb283c62_0 conda-forge libuuid 2.38.1 h4194056_0 conda-forge libvpx 1.13.1 h46f38da_0 conda-forge libwebp 1.3.2 h0f96ee2_0 libwebp-base 1.3.2 hf118e41_0 libxcrypt 4.4.36 ha17a0cc_1 conda-forge libxml2 2.10.4 h18e3229_1 libzlib 1.2.13 h1f2b957_6 conda-forge llvm-openmp 14.0.6 hc028133_0 https://ftp.osuosl.org/pub/open-ce/1.10.0 lmdb 0.9.31 ha17a0cc_1 conda-forge lz4-c 1.9.4 h4a02239_0 markdown 3.4.4 pyhd8ed1ab_0 conda-forge markupsafe 2.1.5 py311h32d8acf_0 conda-forge matplotlib 3.8.0 py311h6ffa863_0 matplotlib-base 3.8.0 py311h52e1fcc_0 menuinst 2.1.1 py311h1af927a_0 conda-forge mpc 1.3.1 heaf1863_0 conda-forge mpfr 4.2.1 haad2271_1 conda-forge mpmath 1.3.0 pyhd8ed1ab_0 conda-forge multidict 6.0.2 py311hf118e41_0 multiprocess 0.70.14 py311h6ffa863_0 munkres 1.1.4 py_0 mypy_extensions 1.0.0 pyha770c72_0 conda-forge nccl 2.18.3 cuda11.8_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 ncurses 6.4 h4a02239_0 nest-asyncio 1.6.0 pyhd8ed1ab_0 conda-forge networkx 2.8.8 pyhd8ed1ab_0 conda-forge nomkl 3.0 0 https://ftp.osuosl.org/pub/open-ce/1.10.0 numactl 2.0.16 hba61f60_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 numexpr 2.8.7 py311hc46fc55_0 numpy 1.24.3 py311h148a09e_0 numpy-base 1.24.3 py311h06b82f6_0 oauthlib 3.2.2 pyhd8ed1ab_0 conda-forge openjpeg 2.4.0 hfe35807_0 openssl 3.3.1 h1f2b957_0 conda-forge orc 1.8.2 h341c9a4_2 conda-forge packaging 23.1 py311h6ffa863_0 pandas 2.1.1 py311h52e1fcc_0 pcre2 10.42 h280155c_0 pillow 10.0.1 py311he33076b_0 pip 23.3 py311h6ffa863_0 platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge pluggy 1.0.0 py311h6ffa863_1 pooch 1.8.2 pyhd8ed1ab_0 conda-forge protobuf 4.21.12 py311ha7baec7_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 psutil 5.9.8 py311hd26027c_0 conda-forge pyarrow 11.0.0 py311h04a18d5_1 pyasn1 0.6.0 pyhd8ed1ab_0 conda-forge pyasn1-modules 0.4.0 pyhd8ed1ab_0 conda-forge pybind11-abi 4 hd3eb1b0_1 pycosat 0.6.6 py311hf118e41_0 pycparser 2.21 pyhd3eb1b0_0 pyjwt 2.8.0 pyhd8ed1ab_1 conda-forge pyopenssl 23.2.0 py311h6ffa863_0 pyparsing 3.0.9 py311h6ffa863_0 pyre-extensions 0.0.30 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 py311h6ffa863_0 python 3.11.8 h3332dee_0_cpython conda-forge python-dateutil 2.8.2 pyhd3eb1b0_0 python-tzdata 2023.3 pyhd3eb1b0_0 python-xxhash 2.0.2 py311hf118e41_1 python_abi 3.11 4_cp311 conda-forge pytorch 2.0.1 cuda11.8_py311_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 pytorch-base 2.0.1 cuda11.8_py311_pb4.21.12_4 https://ftp.osuosl.org/pub/open-ce/1.10.0 pytz 2023.3.post1 py311h6ffa863_0 pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge pyyaml 6.0.1 py311hf118e41_0 re2 2023.02.01 h883269e_0 conda-forge readline 8.2 hf118e41_0 regex 2023.10.3 py311hf118e41_0 reproc 14.2.4 h29c3540_1 reproc-cpp 14.2.4 h29c3540_1 requests 2.31.0 py311h6ffa863_0 requests-oauthlib 2.0.0 pyhd8ed1ab_0 conda-forge responses 0.13.3 pyhd3eb1b0_0 rsa 4.9 pyhd8ed1ab_0 conda-forge ruamel.yaml 0.17.21 py311hf118e41_0 s2n 1.3.37 h5e47323_0 conda-forge safetensors 0.4.0 py311hda16d9e_0 scipy 1.11.1 py311hd69e9bb_0 https://ftp.osuosl.org/pub/open-ce/1.10.0 sentencepiece 0.1.97 h1e74c73_py311_pb4.21.12_2 https://ftp.osuosl.org/pub/open-ce/1.10.0 setuptools 68.0.0 py311h6ffa863_0 six 1.16.0 pyhd3eb1b0_1 snappy 1.1.9 h29c3540_0 sqlite 3.41.2 hf118e41_0 sympy 1.12.1 pypyh2585a3b_103 conda-forge tabulate 0.8.10 pyhd8ed1ab_0 conda-forge tensorboard 2.13.0 pyhab0730d_pb4.21.12_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 tensorboard-data-server 0.7.0 pyh6f84499_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 tensorboard-plugin-wit 1.6.0 pyh9f0ad1d_0 conda-forge tk 8.6.13 hd4bbf49_0 conda-forge tokenizers 0.13.3 py311h3d4f45a_0 torchdata 0.6.0 py311_2 https://ftp.osuosl.org/pub/open-ce/1.10.0 torchsnapshot 0.1.0 pyhd8ed1ab_0 conda-forge torchtext-base 0.15.2 cuda11.8_py311_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 torchtnt 0.2.4 pyhd8ed1ab_0 conda-forge torchvision-base 0.15.2 cuda11.8_py311_1 https://ftp.osuosl.org/pub/open-ce/1.10.0 tornado 6.3.3 py311hf118e41_0 tqdm 4.65.0 py311h7837921_0 transformers 4.32.1 py311h6ffa863_0 truststore 0.8.0 py311h6ffa863_0 typing-extensions 4.7.1 py311h6ffa863_0 typing_extensions 4.7.1 py311h6ffa863_0 typing_inspect 0.9.0 pyhd8ed1ab_0 conda-forge tzdata 2023c h04d1e81_0 urllib3 1.26.18 py311h6ffa863_0 utf8proc 2.6.1 h140841e_0 werkzeug 2.3.8 pyhd8ed1ab_0 conda-forge wheel 0.41.2 py311h6ffa863_0 xxhash 0.8.0 h140841e_3 xz 5.4.2 hf118e41_0 yaml 0.2.5 h7b6447c_0 yaml-cpp 0.8.0 h4a02239_0 yarl 1.8.1 py311hf118e41_0 zipp 3.11.0 py311h6ffa863_0 zlib 1.2.13 h1f2b957_6 conda-forge zstandard 0.19.0 py311hf118e41_0 zstd 1.5.5 h57e4825_0

lhoestq commented 2 months ago

Hi ! can you try updating datasets and huggingface_hub ?

pip install -U datasets huggingface_hub