iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Add a primer on WARC deduplication #17

Open anjackson opened 9 years ago

anjackson commented 9 years ago
pirate commented 3 weeks ago

Does the dedup standard allow for deduping across multiple independent WARC files, or is it only for deduping within a single WARC?

Also are they any considerations made for optimizing filesystem-layer deduping across multiple WARC files? (probably not but I'd just like to confirm it) Is there a way to make sure uncompressed byte sequences start at rounded byte offsets within a WARC so that block-level dedup via something like ZFS fast dedup could detect identital blocks at different locations within two different WARCs.

ato commented 3 weeks ago

Does the dedup standard allow for deduping across multiple independent WARC files

Yes, revisit records can refer to records in other WARC files. The common way they're used is that you run one crawl, producing an initial set of WARC files, and then you run a second crawl that produces a new set of WARC files with revisit records resolving against the original crawl. You then build an index keyed on URL and date with all of the records of both crawls together and use the index to locate the original record that a revisit refers to.

Also are they any considerations made for optimizing filesystem-layer deduping across multiple WARC files

Not that I'm aware of.

Is there a way to make sure uncompressed byte sequences start at rounded byte offsets within a WARC

I haven't heard of anyone doing this and personally I would probably prefer using revisit records but I guess theoretically for uncompressed WARC files you could try adding padding in a custom header field to align the start of the payload with a block boundary.

For gzipped WARCs you could maybe try to add padding to the EXTRA or COMMENT fields in the gzip header. You'd probably also need to compress the payload as a separate gzip member to the WARC and HTTP headers because the headers differing would cause changes in the rest of the compressed stream. There might be some compatibility issues with tools that assume a WARC record doesn't span gzip members though.

pirate commented 3 weeks ago

Awesome thanks for the info!