Closed irene622 closed 1 year ago
Hi @irene622 , thanks for your question!
The run_clean.py
script expects your data to be organized in the same way as when it is downloaded from the arxiv s3 bucket. The most straight forward way is thus probably to simply mirror this structure. In this case you will need to have package your my_arxiv_src
into a a tar file and store it in data/src
. So you need something like this:
data/src
|-my_arxiv_src.tar
|-parpername1.gz
|- name.tex
|-papername2.gz
|-name.tex
|-other_name.tex
You can then call python run_clean.py --data_dir data/src --target_dir /dir/to/desired/output
. Also check out this part in the code of the arxiv_cleaner
module: https://github.com/togethercomputer/RedPajama-Data/blob/d174968e57a0a515982766771d7b805a8b1bc83a/data_prep/arxiv/arxiv_cleaner.py#L131-L177
Let me know if this helps!
I download myself arXiv tex files without using running scripts/arxiv-kickoff-download.sh.
My data structure is
I want to preprocess my latex data, so I run
bash scripts/arxiv-kickoff-cleaning.sh
andarxiv-kickoff-cleaning.sh
is the followingarxiv-kickoff-cleaning.sh
runs with no error but, the result files which arearxiv_1.jsonl
andarxiv_2.jsonl
have not content...What is the DATA_DIR and TARGET_DIR ? Is there anything running method with latex files?