The Document Harvester contains something of a backlog of document records that just won't ever work, but because we have to keep retrying until items are available and playback is working, we end up re-processing any 'broken' records over and over.
For example, there seem to be quite a lot of Welsh GOV records that are empty PDF files. These are not well handled. Similarly, because we draw metadata from the live web, changes to the original site can cause issues.
We need to consider:
Analyzing 'stuck' documents, to check what the problems are, and extend the docharvester tool to detect these cases and reject the records.
Limiting the impact of stuck records, by e.g. adding counters for 'attempts to check CDX' and 'attempts to submit to W3ACT', and maybe adding a 'last processed date' so we only retry individual records e.g. once per day.
The Document Harvester contains something of a backlog of document records that just won't ever work, but because we have to keep retrying until items are available and playback is working, we end up re-processing any 'broken' records over and over.
For example, there seem to be quite a lot of Welsh GOV records that are empty PDF files. These are not well handled. Similarly, because we draw metadata from the live web, changes to the original site can cause issues.
We need to consider:
docharvester
tool to detect these cases and reject the records.