Closed shawn0wang closed 1 month ago
You need to process the warc files by running the CC html parser on it; this will produce a .wet file which you can then pass through the ccnet pipeline.
You can run the following steps (also described here https://groups.google.com/g/common-crawl/c/imv4hlLob4s/m/aDyrdMklEAAJ) to generate the wet files from warc (note that you need java 7 or 8 for that):
git clone https://github.com/commoncrawl/ia-web-commons
cd ia-web-commons
mvn -f pom.xml install
cd -
git clone https://github.com/commoncrawl/ia-hadoop-tools
cd ia-hadoop-tools
mvn package
java -jar ./target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
name_of_archive /path/to/warc/warcfile.warc.gz
How can I do that? I didnt find cc_net can process WARC to WARC.wet