ikreymer / cdx-index-client

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
MIT License
180 stars 48 forks source link

get an WARC archive with all files from a domain #3

Closed dportabella closed 6 years ago

dportabella commented 8 years ago

I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:

$ ./cdx-index-client.py -c CC-MAIN-2015-06 http://www.ipc.com/
$ cat www.ipc.com-0
com,ipc)/ 20150127054500 {"url": "http://www.ipc.com/", "digest": "2WIVV4MGIEL27MAOOREEEKCIATEK43GM", "length": "9953", "offset": "768421563", "filename": "crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz"}
[...]

$ wget https://commoncrawl.s3.amazonaws.com:/crawl-data/CC-MAIN-2015-06/segments/1422115861027.55/warc/CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ gunzip -k CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc.gz
$ cat CC-MAIN-20150124161101-00006-ip-10-180-212-252.ec2.internal.warc | tail -c +768421563 | head -c 9953 >segment1.warc

here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.

wumpus commented 8 years ago

I'd recommend also bringing up issues like this on the common crawl mailing list, it'll be seen by a lot more people. In this case, I can answer your question: the offset is an offset into the compressed WARC. This is so you don't have to download the whole WARC to access just the one page.

dportabella commented 8 years ago

thx, I continue the discussion here: https://groups.google.com/forum/#!topic/common-crawl/0fYTJtFD6Fs