Open tokee opened 4 years ago
For reference, the WARC-Truncated header is the one we need to look out for.
Ideally yes. But the ones we have are faulty, i.e. without that header.
Oh dear. Is there another condition we can look for? Maybe len(bytes) < content_length if not chunked
?
Coding would be 75% easier if we did not have to handle all the tings that can go wrong.
Presumably due to some mis-configuring of Heritrix at the Royal Danish Library, we have a non-trivial amount of truncated records in some of our WARCs. Some of these are silently truncated, i.e. no explicit marking in the WARC, just missing bytes.
webarchive-discovery
should have an option of not indexing such records and/or it should mark them as truncated in the index. We might also consider file-specific handling since movie files tend to handle truncating well, whereas compressed formats such as Word files becomes unusable.