neicnordic / sensitive-data-archive

https://neic-sda.readthedocs.io
3 stars 6 forks source link

[ingest] A second ingest after cancel fails under certain conditions #833

Open kjellp opened 2 months ago

kjellp commented 2 months ago

Describe the bug Under certain conditions, the second ingestion of a file fails and causes an eternal loop. The same file has started being ingested previously (adding a run), but has then been cancelled by the user (deleting the run).

Some more details: The eternal loop of failures to ingest the same file over and over again may block the ingest service and the whole pipeline (if the number of such simultanous failed files exceeds the pre-fetch count x no of replicas of the ingest service in the deployment.)

The ingest service errors out on a DB query to get current checksums of the file (DB method: db.GetFileInfo, checkSums are NULL), at a point in the processing were the message is nacked, so that it can be retried later, causing an eternal loop of failures for this file.

Steps to reproduce

The pre-condition for hitting this bug is:

A straightforward way to make sure the ingest and cancel message for the same file are arriving very synchronous to the ingest service, is to "take down" the intercept service while both messages are being posted "from the outside" (either from the submitter portal or other means) and then bring the intercept service back up.

We had the default prefetch count of 2 and a single ingest service instance running when observing this.

Expected behavior

Additional context

A quick and dirty fix was to modify the failing sql/DB query causing the ingest service to fail in the second ingestion. The DB query returns a row of null values, which is seen as an error and not passed correctly back to the go code. By using COALESCE to replace NULL with "" in the SQL query, the DB result is returned without errors and the ingest code (and the rest of the pipeline) seem to handle the situation of the new "" result ok'ish (still fails, nack message, but next attempt goes through as file status in DB has now changed. A later service in this second ingestion attempt still complains about not being able to insert a checksum in the DB table since it's already there, but should be a harmless warning).

This issue may fit into a bigger picture of how to handle cancel messages in a robust manner.

aaperis commented 2 months ago

After discussions within the SE team, a short-term fix would be the following:

P.S. What version of the sda stack was running when the above was observed?