[ingest] A second ingest after cancel fails under certain conditions

Describe the bug Under certain conditions, the second ingestion of a file fails and causes an eternal loop. The same file has started being ingested previously (adding a run), but has then been cancelled by the user (deleting the run).

Some more details: The eternal loop of failures to ingest the same file over and over again may block the ingest service and the whole pipeline (if the number of such simultanous failed files exceeds the pre-fetch count x no of replicas of the ingest service in the deployment.)

The ingest service errors out on a DB query to get current checksums of the file (DB method: db.GetFileInfo, checkSums are NULL), at a point in the processing were the message is nacked, so that it can be retried later, causing an eternal loop of failures for this file.

Steps to reproduce

The pre-condition for hitting this bug is:

the cancel message must arrive very short after the ingest message of the same file
that will cause the verify service as part of the first ingestion (correlation-id trail of messages) will already see the file flagged as disabled in the DB, and abort the ingestion process at this point

A straightforward way to make sure the ingest and cancel message for the same file are arriving very synchronous to the ingest service, is to "take down" the intercept service while both messages are being posted "from the outside" (either from the submitter portal or other means) and then bring the intercept service back up.

We had the default prefetch count of 2 and a single ingest service instance running when observing this.

Expected behavior

[ ] Proper ingestion of the file the second time after a cancel message, not depending on a minimum time between ingest, cancel, ingest messages for the same file
[ ] Tests verifying the fix are added

Additional context

A quick and dirty fix was to modify the failing sql/DB query causing the ingest service to fail in the second ingestion. The DB query returns a row of null values, which is seen as an error and not passed correctly back to the go code. By using COALESCE to replace NULL with "" in the SQL query, the DB result is returned without errors and the ingest code (and the rest of the pipeline) seem to handle the situation of the new "" result ok'ish (still fails, nack message, but next attempt goes through as file status in DB has now changed. A later service in this second ingestion attempt still complains about not being able to insert a checksum in the DB table since it's already there, but should be a harmless warning).

This issue may fit into a bigger picture of how to handle cancel messages in a robust manner.

neicnordic / sensitive-data-archive

[ingest] A second ingest after cancel fails under certain conditions #833