Closed dportabella closed 6 years ago
I'd recommend also bringing up issues like this on the common crawl mailing list, it'll be seen by a lot more people. In this case, I can answer your question: the offset is an offset into the compressed WARC. This is so you don't have to download the whole WARC to access just the one page.
thx, I continue the discussion here: https://groups.google.com/forum/#!topic/common-crawl/0fYTJtFD6Fs
I'd like to download all pages from the www.ipc.com domain in a WARC archive file (or several files). so I do as follows:
here, I would expect to get some WARC entries of www.ipc.com, but I get a "random" trunk of the input file.