huggingface / olm-datasets

Pipeline for pulling and processing online language model pretraining data from the web
Apache License 2.0
174 stars 23 forks source link

load wikipedia failed when language is zh #5

Closed pczzy closed 1 year ago

pczzy commented 1 year ago

dataset = load_dataset('olm/wikipedia',cache_dir="/data0/zhangpeng6/hgface_cache",language="zh", date="20230101",num_proc=16)

"""
Traceback (most recent call last):
  File "/data0/zhangpeng6/miniconda3/envs/torch2/lib/python3.10/site-packages/datasets/builder.py", line 1570, in _prepare_split_single
    for key, record in generator:
  File "/usr/home/zhangpeng6/.cache/huggingface/modules/datasets_modules/datasets/olm--wikipedia/dbfec0358f063ec7ae9e247d6559e2e505fbce7463e666024718863cbf199ec6/wikipedia.py", line 1032, in _generate_examples
    with Manager() as manager:
  File "/data0/zhangpeng6/miniconda3/envs/torch2/lib/python3.10/site-packages/multiprocess/context.py", line 57, in Manager
    m.start()
  File "/data0/zhangpeng6/miniconda3/envs/torch2/lib/python3.10/site-packages/multiprocess/managers.py", line 562, in start
    self._process.start()
  File "/data0/zhangpeng6/miniconda3/envs/torch2/lib/python3.10/site-packages/multiprocess/process.py", line 118, in start
    assert not _current_process._config.get('daemon'), \
AssertionError: daemonic processes are not allowed to have children

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data0/zhangpeng6/miniconda3/envs/torch2/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/data0/zhangpeng6/miniconda3/envs/torch2/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1342, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/data0/zhangpeng6/miniconda3/envs/torch2/lib/python3.10/site-packages/datasets/builder.py", line 1607, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
"""
TristanThrush commented 1 year ago

Hi there, thanks for raising this issue! Are you able to tell me:

  1. What OS you are using
  2. Whether other languages work for you, or if this just an issue for zh
pczzy commented 1 year ago

OS: CentOS Linux release 7.9.2009 (Core) Locale: LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

Python: 3.10.8

Other languages do not work for me

TristanThrush commented 1 year ago

Thanks. I'm at a hugging face offsite right now, but will look into this soon

TristanThrush commented 1 year ago

Hey thanks for your patience!

I was able to reproduce your issue. The issue is that the num_proc=16 argument that you are using was actually just added in one of the latest datasets releases. It is incompatible with the olm/wikipedia dataloader script. Just leave out the num_proc=16 and let me know how it goes. The script will use all of your cpus anyway so you don't need to specify num_proc=16 anywhere.

pczzy commented 1 year ago

Hey, It's great, remove num_proc leads to all of my cpus used. thanks a million.