iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
50 stars 72 forks source link

WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself #42

Closed saraaubry closed 9 years ago

saraaubry commented 9 years ago

In the current implementation of the WAT extractor, the WARC-Filename in tht WAT warcinfo record corresponds to the filename of the original (W)ARC record. According to the WARC ISO standard, it should be the WAT filename itself.

Current: WARC/1.0 WARC-Type: warcinfo WARC-Date: 2015-02-18T10:24:54Z WARC-Filename: BnF-6224-50-20150218094547-00001-ciblee_2015_menelas2.bnf.fr.warc.gz WARC-Record-ID: urn:uuid:97a37ea9-1af4-4c47-8ae0-5515428347aa Content-Type: application/warc-fields Content-Length: 73

Target: WARC/1.0 WARC-Type: warcinfo WARC-Date: 2015-02-18T10:24:54Z WARC-Filename: BnF-6224-50-20150218094547-00001-ciblee_2015_menelas2.bnf.fr.warc.wat.gz WARC-Record-ID: urn:uuid:97a37ea9-1af4-4c47-8ae0-5515428347aa Content-Type: application/warc-fields Content-Length: 73

Implementation: java extractor.jar -wat fichierA.warc.gz --> will go to standard output WARC-Filename: fichierA.warc.gz => fichierA.warc.wat.gz fichierA.arc.gz => fichierA.arc.wat.gz fichierA.warc => fichierA.warc.wat fichierA.arc => fichierA.arc.wat

java extractor.jar -wat fichierA.warc.gz fichierB.wat.warc.gz --> will go to file fichierB output WARC-Filename: fichierB.wat.warc.gz