Closed pczzy closed 1 year ago
Hi there, thanks for raising this issue! Are you able to tell me:
OS: CentOS Linux release 7.9.2009 (Core) Locale: LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
Python: 3.10.8
Other languages do not work for me
Thanks. I'm at a hugging face offsite right now, but will look into this soon
Hey thanks for your patience!
I was able to reproduce your issue. The issue is that the num_proc=16
argument that you are using was actually just added in one of the latest datasets
releases. It is incompatible with the olm/wikipedia dataloader script. Just leave out the num_proc=16
and let me know how it goes. The script will use all of your cpus anyway so you don't need to specify num_proc=16
anywhere.
Hey, It's great, remove num_proc leads to all of my cpus used. thanks a million.
dataset = load_dataset('olm/wikipedia',cache_dir="/data0/zhangpeng6/hgface_cache",language="zh", date="20230101",num_proc=16)