ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Mark truncated files #212

Open tokee opened 4 years ago

tokee commented 4 years ago

Presumably due to some mis-configuring of Heritrix at the Royal Danish Library, we have a non-trivial amount of truncated records in some of our WARCs. Some of these are silently truncated, i.e. no explicit marking in the WARC, just missing bytes.

webarchive-discovery should have an option of not indexing such records and/or it should mark them as truncated in the index. We might also consider file-specific handling since movie files tend to handle truncating well, whereas compressed formats such as Word files becomes unusable.

anjackson commented 4 years ago

For reference, the WARC-Truncated header is the one we need to look out for.

tokee commented 4 years ago

Ideally yes. But the ones we have are faulty, i.e. without that header.

anjackson commented 4 years ago

Oh dear. Is there another condition we can look for? Maybe len(bytes) < content_length if not chunked?

tokee commented 4 years ago

Coding would be 75% easier if we did not have to handle all the tings that can go wrong.