togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

ArXiv cleaning issue #68

Closed hicotton02 closed 11 months ago

hicotton02 commented 1 year ago

was able to get the content downloaded from S3 (shows 181GB) and attempted to run the ./run_clean.py script. I get thousands of errors like this one:

[2023-08-02T22:28:40.123948][ERROR] UnicodeDecodeError: ~/Documents/ai_data/RedPajama-Data/data_prep/arxiv/work/329f2d6d-b1f1-48f6-ac00-c42769cdb1ef__e515y_o/tmp0t76cedh/0809/0809.0966.gz

and then the stack trace:

Traceback (most recent call last):
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 125, in <module>
    main()
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 116, in main
    run_clean(
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 67, in run_clean
    arxiv_cleaner.run_parallel(
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 60, in run_parallel
    for record, arxiv_id in executor.map(
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 766, in map
    results = super().map(partial(_process_chunk, fn),
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 190, in _get_chunks
    chunk = tuple(itertools.islice(it, chunksize))
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 146, in arxiv_iterator
    with tempfile.TemporaryDirectory(dir=self._work_dir) as tmpdir:
  File "/usr/lib/python3.10/tempfile.py", line 1008, in __exit__
    self.cleanup()
  File "/usr/lib/python3.10/tempfile.py", line 1012, in cleanup
    self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
  File "/usr/lib/python3.10/tempfile.py", line 994, in _rmtree
    _rmtree(name, onerror=onerror)
  File "/usr/lib/python3.10/shutil.py", line 725, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/lib/python3.10/shutil.py", line 664, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 662, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: '1012'

I have attempted to re-download it once, but due to costs, dont want to try again without reaching out.

mauriceweber commented 1 year ago

Hi @hicotton02 !

Some amount of UnicdeDecodeError are expected -- they get caught whenever a .tex file has characters which cannot be decoded using utf-8 characters. But in any case, let me know if the majority of documents can't get processed due to this error.

Regarding the stacktrace, can you show me the command and arguments you're using to run the script?