iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
50 stars 71 forks source link

WAT extractor: do not fail on missing WARC-Filename in warcinfo record #88

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

(see commoncrawl/ia-web-commons#23)

The WAT resource extractor fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header (eg. CC-NEWS-20160827132735-00002.warc.gz):

$> java -cp ... org.archive.extract.ResourceExtractor -wat CC-NEWS-20160827132735-00002.warc.gz
Exception in thread "main" java.io.IOException: No Envelope.WARC-Header-Metadata.WARC-Filename found.
        at org.archive.extract.WATExtractorOutput.extractOrIO(WATExtractorOutput.java:136)
        at org.archive.extract.WATExtractorOutput.writeWARC(WATExtractorOutput.java:154)
        at org.archive.extract.WATExtractorOutput.output(WATExtractorOutput.java:74)
        at org.archive.extract.ResourceExtractor.run(ResourceExtractor.java:139)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.archive.extract.ResourceExtractor.main(ResourceExtractor.java:62)

So, the simplest solution would be to extract a metadata record (concurrent to the warcinfo w/o WARC-Filename) without a WARC-Target-URI header.