prnake / CialloCorpus

人民日报(1946-2023)、习近平系列重要讲话数据库
45 stars 2 forks source link

huggingface下载的时候报错 #1

Open conkeur opened 8 months ago

conkeur commented 8 months ago

下载用的代码:

from datasets import load_dataset
dataset_name = "Papersnake/people_daily_news"
dataset = load_dataset(dataset_name,cache_dir=r'xxx/')

错误信息:

An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 2 missing columns ({'author', 'page'})

This happened while the json dataset builder was generating data using

..\downloads\d434406d0e80132d996bc6796817699b81390d86744e10acda0ec2ea71fead71

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback (most recent call last):
  File "_pydevd_bundle/pydevd_cython.pyx", line 546, in _pydevd_bundle.pydevd_cython.PyDBFrame._handle_exception
  File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
    def getline(filename, lineno, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
    def getlines(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
    def updatecache(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
    def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte
0.03s - Error on build_exception_info_response.
Traceback (most recent call last):
  File "c:\program files\microsoft visual studio\2022\community\common7\ide\extensions\microsoft\python\core\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 1404, in build_exception_info_response
    def build_exception_info_response(dbg, thread_id, request_seq, set_additional_thread_info, iter_visible_frames_info, max_frames):
  File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
    def getline(filename, lineno, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
    def getlines(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
    def updatecache(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
    def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte
0.03s - Error on build_exception_info_response.
Traceback (most recent call last):
  File "c:\program files\microsoft visual studio\2022\community\common7\ide\extensions\microsoft\python\core\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 1404, in build_exception_info_response
    def build_exception_info_response(dbg, thread_id, request_seq, set_additional_thread_info, iter_visible_frames_info, max_frames):
  File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
    def getline(filename, lineno, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
    def getlines(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
    def updatecache(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
    def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte

打开看了对应的文件,内容是这个: {"url": "hf://datasets/Papersnake/people_daily_news@e61323bc7692312d907fc2d154b4ffc4290ce496/2004.jsonl.gz", "etag": null}

prnake commented 8 months ago

不同年份的 jsonl 并不能保证以相同格式提供,建议下载后手动处理,例如在 git lfs 安装后,使用 git clone https://huggingface.co/datasets/Papersnake/people_daily_news 来下载数据。

conkeur commented 8 months ago

不同年份的 jsonl 并不能保证以相同格式提供,建议下载后手动处理,例如在 git lfs 安装后,使用 git clone https://huggingface.co/datasets/Papersnake/people_daily_news 来下载数据。 好的,我试试