openpreserve / nanite

Nanite - a friendly swarm of format-identifying robots.
openplanets.github.io/nanite/
15 stars 13 forks source link

Map-reduce program to check whether gz files can be opened successfully #8

Closed willp-bl closed 10 years ago

willp-bl commented 10 years ago

Some of the warc files we have for testing are zero length, or may not be valid gz files. This map-reduce program quickly checks all gz files to see if there are any issues, status is shown in the output from the reducer.

It would be possible to chain this map-reduce to run immediately before the format-profiler.

Approx runtime for this program is ~5mins/~15k warc files totaling 1TB on our cluster.

willp-bl commented 10 years ago

I developed this as some of our FormatProfiler runs have been failing near the end of a long runtime due to problems with an input file. An alternative fix would be to put a fix for broken files in the RecordReader in warc-discovery, but that may not necessarily highlight that an input file has an issue.