Closed sebastian-nagel closed 3 years ago
Fix for the infinite loop released as 0.16.5.
The invalid ARC trailer warning on 1266352769711_14.arc.gz I think is correct. The length field for the filedesc:// record seems to be off by one, whereas the subsequent records it's correct. Although I suppose an argument could be made that the definition of the length field in the version block in the spec as "the rest of the version block" is vague and could possibly be interpreted to include the first newline or the trailer newline even though this is inconsistent the response records where it's clearly defined to exclude these.
The ARCs in our collection both those generated by Heritrix and others supplied by IA are all consistent with jwarc's definition of the length.
On certain ARC files the parser may run into an infinite loop. So far, I've found the following ARC files which reproducibly cause the hang-up when running the "validate" tool:
IAH-20080430204825-00000-blackbook-truncated.arc
- part of ukwa/webarchive-test-suite and also used by jwat as test resource. Note: when parsing the gzipped variant (also part of the test suite) the parser complains about an "invalid ARC trailer". The stack during the hang-up:the gzipped ARC 1266352769711_14.arc.gz (Common Crawl 2010):