The WAT resource extractor fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header (eg. CC-NEWS-20160827132735-00002.warc.gz):
$> java -cp ... org.archive.extract.ResourceExtractor -wat CC-NEWS-20160827132735-00002.warc.gz
Exception in thread "main" java.io.IOException: No Envelope.WARC-Header-Metadata.WARC-Filename found.
at org.archive.extract.WATExtractorOutput.extractOrIO(WATExtractorOutput.java:136)
at org.archive.extract.WATExtractorOutput.writeWARC(WATExtractorOutput.java:154)
at org.archive.extract.WATExtractorOutput.output(WATExtractorOutput.java:74)
at org.archive.extract.ResourceExtractor.run(ResourceExtractor.java:139)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.archive.extract.ResourceExtractor.main(ResourceExtractor.java:62)
(see commoncrawl/ia-web-commons#23)
The WAT resource extractor fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header (eg. CC-NEWS-20160827132735-00002.warc.gz):
So, the simplest solution would be to extract a metadata record (concurrent to the warcinfo w/o WARC-Filename) without a WARC-Target-URI header.