ArXiv cleaning issue - Githubissues

was able to get the content downloaded from S3 (shows 181GB) and attempted to run the ./run_clean.py script. I get thousands of errors like this one:

[2023-08-02T22:28:40.123948][ERROR] UnicodeDecodeError: ~/Documents/ai_data/RedPajama-Data/data_prep/arxiv/work/329f2d6d-b1f1-48f6-ac00-c42769cdb1ef__e515y_o/tmp0t76cedh/0809/0809.0966.gz

and then the stack trace:

Traceback (most recent call last):
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 125, in <module>
    main()
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 116, in main
    run_clean(
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/run_clean.py", line 67, in run_clean
    arxiv_cleaner.run_parallel(
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 60, in run_parallel
    for record, arxiv_id in executor.map(
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 766, in map
    results = super().map(partial(_process_chunk, fn),
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 610, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 190, in _get_chunks
    chunk = tuple(itertools.islice(it, chunksize))
  File "/home/theskaz/Documents/ai_data/RedPajama-Data/data_prep/arxiv/arxiv_cleaner.py", line 146, in arxiv_iterator
    with tempfile.TemporaryDirectory(dir=self._work_dir) as tmpdir:
  File "/usr/lib/python3.10/tempfile.py", line 1008, in __exit__
    self.cleanup()
  File "/usr/lib/python3.10/tempfile.py", line 1012, in cleanup
    self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
  File "/usr/lib/python3.10/tempfile.py", line 994, in _rmtree
    _rmtree(name, onerror=onerror)
  File "/usr/lib/python3.10/shutil.py", line 725, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/lib/python3.10/shutil.py", line 664, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 662, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: '1012'

I have attempted to re-download it once, but due to costs, dont want to try again without reaching out.

togethercomputer / RedPajama-Data

ArXiv cleaning issue #68