togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet #64

Closed shawn0wang closed 1 month ago

shawn0wang commented 1 year ago

How can I do that? I didnt find cc_net can process WARC to WARC.wet

mauriceweber commented 1 year ago

You need to process the warc files by running the CC html parser on it; this will produce a .wet file which you can then pass through the ccnet pipeline.

You can run the following steps (also described here https://groups.google.com/g/common-crawl/c/imv4hlLob4s/m/aDyrdMklEAAJ) to generate the wet files from warc (note that you need java 7 or 8 for that):

git clone https://github.com/commoncrawl/ia-web-commons
cd ia-web-commons
mvn -f pom.xml install
cd -

git clone https://github.com/commoncrawl/ia-hadoop-tools
cd ia-hadoop-tools
mvn package

java -jar ./target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
name_of_archive /path/to/warc/warcfile.warc.gz