ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Fixed toCDXLine digest field #68

Closed KarlXerri closed 3 years ago

KarlXerri commented 3 years ago

Fixed issue where toCDXLine generates a digest using the scheme prefix instead of the raw sha1 value. Now uses getContentDigestString instead of getContentDigestSchemeString.

By using getContentDigestSchemeString, the module would prepare the CDX Line with a digest such as sha1:luwq7wuizbu3bgfqgpkfmplfx5sefgsy rather than the correct version as LUWQ7WUIZBU3BGFQGPKFMPLFX5SEFGSY.

Prior to outbackcdx-0.11.0 the resulting digest would be parsed incorrectly and stored in a format such as SHALUWQ7WUIZBU3BGFQGPKFMPLFX5S-----.

In any case, when reading CDX records back out of Outback, they would never produce a positive match with a corresponding Heritrix digest.

anjackson commented 3 years ago

Great, thank you! Utterly mystified as to why it used to work.