togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

how to process arXiv tex files without downloading? #51

Closed irene622 closed 1 year ago

irene622 commented 1 year ago

I download myself arXiv tex files without using running scripts/arxiv-kickoff-download.sh.

My data structure is

my_arxiv_src
 |- papername1
      |- name.tex
 |- papername2
      |- name.tex
      |- other_name.tex

I want to preprocess my latex data, so I run bash scripts/arxiv-kickoff-cleaning.sh and arxiv-kickoff-cleaning.sh is the following

#!/bin/bash

set -e

WORKERS=2

# load modules
module load gcc/10.2.0 cuda/11.4 cudampi/openmpi-4.1.1 conda/pytorch_1.12.0
pip install -r arxiv_requirements.txt

export DATA_DIR="./my_ arxiv_src"
export TARGET_DIR="./data/arxiv/processed"
export WORK_DIR="./work"

mkdir -p logs/arxiv/cleaning

# setup partitions
python run_clean.py --data_dir "$DATA_DIR" --target_dir "$TARGET_DIR" --workers $WORKERS --setup

# run download in job array
sbatch scripts/arxiv-clean-slurm.sbatch

arxiv-kickoff-cleaning.sh runs with no error but, the result files which are arxiv_1.jsonl and arxiv_2.jsonl have not content...

What is the DATA_DIR and TARGET_DIR ? Is there anything running method with latex files?

mauriceweber commented 1 year ago

Hi @irene622 , thanks for your question!

The run_clean.py script expects your data to be organized in the same way as when it is downloaded from the arxiv s3 bucket. The most straight forward way is thus probably to simply mirror this structure. In this case you will need to have package your my_arxiv_src into a a tar file and store it in data/src. So you need something like this:

data/src
|-my_arxiv_src.tar
    |-parpername1.gz
        |- name.tex
    |-papername2.gz
        |-name.tex
        |-other_name.tex

You can then call python run_clean.py --data_dir data/src --target_dir /dir/to/desired/output. Also check out this part in the code of the arxiv_cleaner module: https://github.com/togethercomputer/RedPajama-Data/blob/d174968e57a0a515982766771d7b805a8b1bc83a/data_prep/arxiv/arxiv_cleaner.py#L131-L177

Let me know if this helps!