ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Make Document Harvester handle 'broken' items more efficiently #95

Closed anjackson closed 1 year ago

anjackson commented 2 years ago

The Document Harvester contains something of a backlog of document records that just won't ever work, but because we have to keep retrying until items are available and playback is working, we end up re-processing any 'broken' records over and over.

For example, there seem to be quite a lot of Welsh GOV records that are empty PDF files. These are not well handled. Similarly, because we draw metadata from the live web, changes to the original site can cause issues.

We need to consider:

anjackson commented 1 year ago

The refactored DDHAPT kinda sidesteps all this. So calling it closed.