iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

ARC parser infinite loop reading body #62

Closed sebastian-nagel closed 2 years ago

sebastian-nagel commented 2 years ago

On certain ARC files the parser may run into an infinite loop. So far, I've found the following ARC files which reproducibly cause the hang-up when running the "validate" tool:

ato commented 2 years ago

Fix for the infinite loop released as 0.16.5.

The invalid ARC trailer warning on 1266352769711_14.arc.gz I think is correct. The length field for the filedesc:// record seems to be off by one, whereas the subsequent records it's correct. Although I suppose an argument could be made that the definition of the length field in the version block in the spec as "the rest of the version block" is vague and could possibly be interpreted to include the first newline or the trailer newline even though this is inconsistent the response records where it's clearly defined to exclude these.

The ARCs in our collection both those generated by Heritrix and others supplied by IA are all consistent with jwarc's definition of the length.