padeoe / hf-mirror-site

a huggingface mirror site.
https://hf-mirror.com
238 stars 29 forks source link

使用镜像作为环境变量后部分数据在最新版 datasets 上无法正常下载 #22

Open yucc-leon opened 7 months ago

yucc-leon commented 7 months ago

干净的环境,Python=3.11,只安装了 datasets(==2.18.0)

from datasets import load_dataset
dataset = load_dataset("codeparrot/apps")

会提示

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 1879, in dataset_module_factory
    raise e1 from None
  File "/home/xxx/miniforge3/envs/test/lib/python3.11/site-packages/datasets/load.py", line 1831, in dataset_module_factory
    can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
                                                                       ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

官方开发人员无法复现,由于该错误仅出现在读取数据文件前向服务器请求数据集信息的阶段,因此猜测问题出在镜像上(参考https://github.com/huggingface/datasets/issues/6760 中的反馈)

halfrot commented 7 months ago

+1 encounter the same issue here

padeoe commented 7 months ago

+1 encounter the same issue here

@halfrot huggingface_hub版本是多少,请尝试最新版这个问题应该修复了

halfrot commented 7 months ago

huggingface_hub 0.22.2 是最新版的

alex-hek commented 7 months ago

我曾经遇到过类似的问题 可能是压缩包解码的问题,通过pip install py7zr解决

rangehow commented 5 months ago

大佬有新进展吗? @padeoe

我的datasets版本是2.19.2 hub版本是2.23.0。 加载的数据集是 EleutherAI/drop,也碰到同样的问题

padeoe commented 5 months ago

已经有一些进展了,还在看

rangehow commented 5 months ago

已经有一些进展了,还在看

感谢大佬的无私奉献,你们的工作对国内AI研究起到了很大的帮助~

xuyuzhuang11 commented 5 months ago

pip install py7zr

此法并不能解决该问题

xuyuzhuang11 commented 5 months ago

请参考https://github.com/huggingface/datasets/issues/6760这个issue 把datasets换成2.14.6即可解决 :-)

yucc-leon commented 5 months ago

另一种报错:加载数据集 code_search_net 时,子集下载和校验过程中仍提示超时

Downloading builder script: 8.44kB [00:00, 50.2MB/s]
Downloading readme: 12.9kB [00:00, 73.3MB/s]
Downloading data: 100%|███████████████████████████████████████████████████████████████████| 941M/941M [00:39<00:00, 23.6MB/s]
---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
....
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:
......

ReadTimeout: HTTPSConnectionPool(host='cdn-lfs.hf-mirror.com', port=443): Read timed out. (read timeout=100.0)

降级到 2.14.6 及更低版本也可暂时解决此问题。

VincentZ-2020 commented 2 months ago

所以是墙的问题?同降级能解决

hl0737 commented 2 months ago

dataset = load_dataset("lighteval/MATH")

image

同遇到此问题,好像有load script的数据集极大概率会遇到此问题,请问大家有什么好的解决方案嘛,谢谢

环境如下: datasets 2.21.0 transformers 4.44.2 huggingface-hub 0.24.5

update: 只降级datasets到2.14.6可临时解决问题~~

hl0737 commented 2 months ago

我曾经遇到过类似的问题 可能是压缩包解码的问题,通过pip install py7zr解决

这个对我没用

Tomorrowdawn commented 1 month ago

2024-10-01 仍然遇到此问题 datasets下降版本至2.14.6可暂时解决此问题, 但是出现了trust_remote_code关键字不兼容(似乎新版本加入)

随着datasets的不断更新, 可能分歧会越来越大, 希望能尽快解决, 也希望以上信息能有所帮助.

hl0737 commented 1 month ago

datasets 3.0.1 版本,问题依旧

image image

hl0737 commented 1 month ago

10月20日,问题依旧,datasets 3.0.1版本已更新batch函数,但目前2.14.6无此函数

hl0737 commented 2 weeks ago

image update: 新版trl已经限制datasets最低版本