We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet

You need to process the warc files by running the CC html parser on it; this will produce a .wet file which you can then pass through the ccnet pipeline.

You can run the following steps (also described here https://groups.google.com/g/common-crawl/c/imv4hlLob4s/m/aDyrdMklEAAJ) to generate the wet files from warc (note that you need java 7 or 8 for that):

git clone https://github.com/commoncrawl/ia-web-commons
cd ia-web-commons
mvn -f pom.xml install
cd -

git clone https://github.com/commoncrawl/ia-hadoop-tools
cd ia-hadoop-tools
mvn package

java -jar ./target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
name_of_archive /path/to/warc/warcfile.warc.gz

togethercomputer / RedPajama-Data

We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet #64