ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
10 stars 7 forks source link

Fixed JSON property name to fix length always being zero #71

Closed KarlXerri closed 3 years ago

KarlXerri commented 3 years ago

This issue only affects crawls using the OutbackCDXClient bean.

The ExtendedWARCWriterProcessor in ukwa-heritrix sets the metadata property warcFileRecordLength, however in OutbackCDXClient attempts to read the property as warcRecordLength.

This results in all CDX lines containing a zero-value length field.

This fix changes OutbackCDXClient to instead read warcFileRecordLength, resulting in correct CDX length values.

Side note: If using OutbackCDXClient, you must also use ExtendedWARCWriterProcessor, otherwise there will be no warcFileRecordLength to reference.

anjackson commented 3 years ago

Thanks very much for this. Somehow managed to miss this when renaming the field to make it consistent with the existing fields.