Closed willp-bl closed 10 years ago
I developed this as some of our FormatProfiler runs have been failing near the end of a long runtime due to problems with an input file. An alternative fix would be to put a fix for broken files in the RecordReader in warc-discovery, but that may not necessarily highlight that an input file has an issue.
Some of the warc files we have for testing are zero length, or may not be valid gz files. This map-reduce program quickly checks all gz files to see if there are any issues, status is shown in the output from the reducer.
It would be possible to chain this map-reduce to run immediately before the format-profiler.
Approx runtime for this program is ~5mins/~15k warc files totaling 1TB on our cluster.