karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
36.87k stars 5.83k forks source link

Dataset load #89

Open thremilien opened 1 year ago

thremilien commented 1 year ago

Hello I've an issue while loading my dataset in prepare.py (for obenwebtext). The download and the extraction complete successfully but the generation of train split raise an error.

I've already try to look for the file 0180327-a95f1342cd685fb7d22805aa720870d2.txt in the archive and add it manually to the extracted dataset but it doesn't work. The ignore_verification is False.

If you need more informations I can give you whatever you need

Thanks for your help

Config :


Computing checksums of downloaded files. They can be used for integrity verification. You can disable this by passing ignore_verifications=True to load_dataset
Computing checksums: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.82s/it]
C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\download\download_manager.py:431: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
  warnings.warn(
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20610/20610 [05:27<00:00, 62.85it/s]
Generating train split:   0%|▋                                                                                                                                                | 35271/8013769 [01:43<2:24:57, 917.33 examples/s]Traceback (most recent call last):
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1570, in _prepare_split_single
    for key, record in generator:
  File "C:\Users\emili\.cache\huggingface\modules\datasets_modules\datasets\openwebtext\85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1\openwebtext.py", line 85, in _generate_examples
    with open(filepath, encoding="utf-8") as f:
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\streaming.py", line 69, in wrapper
    return function(*args, use_auth_token=use_auth_token, **kwargs)
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\download\streaming_download_manager.py", line 445, in xopen
    return open(main_hop, mode, *args, **kwargs)
OSError: [Errno 22] Invalid argument: 'C:\\Users\\emili\\.cache\\huggingface\\datasets\\downloads\\extracted\\85b7a70ee547a4372aa7cf8fab0e93cd8849e09e1cba8454c1d113746400e918\\0180327-a95f1342cd685fb7d22805aa720870d2.txt'    

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\emili\Desktop\nanoGPT\data\openwebtext\prepare.py", line 15, in <module>
    dataset = load_dataset("openwebtext")
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\load.py", line 1757, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 860, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1611, in _download_and_prepare
    super()._download_and_prepare(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 953, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1449, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\Users\emili\Desktop\nanoGPT\venv\lib\site-packages\datasets\builder.py", line 1606, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
zjsuper commented 1 year ago

Same problem, any ideas?

Coriana commented 1 year ago

it looks like windows is deleting files that contains js shellcode exploits causing the load to fail.

patrobadri commented 1 year ago

Same problem, any idea?

zjsuper commented 1 year ago

Set num_proc = 1 and shut down All Windows Virus & threat protection and Firewall &network protection solved the problem.

hanfluid commented 1 year ago

Same issue here.