Open corneliusroemer opened 6 months ago
This piece of code in the e2e reproduces the issue:
await Promise.all([
fakeProcessingPipeline.query(absurdlyManySoThatAllSequencesAreInProcessing),
fakeProcessingPipeline.query(absurdlyManySoThatAllSequencesAreInProcessing),
fakeProcessingPipeline.query(absurdlyManySoThatAllSequencesAreInProcessing),
fakeProcessingPipeline.query(absurdlyManySoThatAllSequencesAreInProcessing),
fakeProcessingPipeline.query(absurdlyManySoThatAllSequencesAreInProcessing),
]);
The problem is that many preprocessing pipelines pull data in parallel. Under normal circumstances this should not happen in a real deployment with real preprocessing pipelines. But I think it would still be good to implement something that prevents this error. It comes from this insert statement: https://github.com/loculus-project/loculus/blob/2589319684c2b6589f596464e46ea86e9a7f1bb2/backend/src/main/kotlin/org/loculus/backend/service/submission/SubmissionDatabaseService.kt#L164-L170
I'm still thinking how to best implement this.
/extract-unprocessed-data
endpoint? This seems non-trivial, because the data is streamed lazily to the client.Do I understand correctly that there are two things that are not working properly?
At the same time, regarding Cornelius' concern:
I'm not sure if this is just a test error, or this shows that our backend can end up in a bad state
The backend (or database) is not really in a bad state. The database does not get corrupted - right?
Do I understand correctly that there are two things that are not working properly?
1. We would like to avoid sending the same unprocessed sequence to multiple pipelines at the same time but at the moment, if there are pipelines asking for sequences at the same time, this may happen. (In practice for a future big instance, it is certainly realistic that we do have many pipeline instances running in parallel).
Yes, correct. The second pipeline will then receive an error.
2. If a processing pipeline submits a sequence that has already been already processed, we would expect that the backend either rejects the pipeline's submission with a proper error message or, as you suggested, simply ignores that sequence. This is not the case at the moment.
I'm quite sure that the backend rejects this with a proper error message already. But in this case, we don't even get there. This is only about sending the sequences from the backend to the pipeline. So I was thinking about ignoring the sequence when sending it to the pipeline - this solution feels a bit hacky, but actually I think it's the desired behavior.
At the same time, regarding Cornelius' concern:
I'm not sure if this is just a test error, or this shows that our backend can end up in a bad state
The backend (or database) is not really in a bad state. The database does not get corrupted - right?
The Database does not get corrupted. The preprocessing pipelines get invalid respones (as they contain a plain error message at the end, not a NDJSON). In the worst case, sequences don't get processed because a pipeline crashes and stay "in processing" until the scheduler puts them back to "received". But still, every sequence is received only by a single pipeline, because sending it a second time already errors.
okay, can we solve this by changing the order when we fetch unprocessed data? We first insert into sequence_entries_preprocessed_data
:
insert into sequence_entries_preprocessed_data (accession, version, pipeline_version)
select se.accession, se.version, ?
from sequence_entries se
where
se.organism = '...'
and not exists(
select
from sequence_entries_preprocessed_data sepd
where
se.accession = sepd.accession
and se.version = sepd.version
and sepd.pipeline_version = ?
)
order by se.accession
limit ?
returning accession, version;
and then fetch the original data?
I'm not sure how far the databases transaction management system goes and whether it will be able to avoid insert statements to clash. I would guess that the insert statement will all work but if there are sometimes errors, we might need to retry the insert query until it succeeds but that should be fine because we wouldn't have started streaming, yet.
Thanks for digging into this error! Good to know that the database doesn't end up in a corrupt state. Running multiple pipelines in parallel without crashing all the time would be nice but we can also tackle that once we get to it, so not super high priority.
If anyone works on this later, I pushed my debug stuff to https://github.com/loculus-project/loculus/tree/1882-e2e-tests-put-backend-into-state-that-errors-with-error-duplicate-key-value-violates-unique-constraint-sequence_entries_preprocessed_data_pkey
I've been investigating flaky E2E tests, improving the error messages to see more what's going on (see https://github.com/loculus-project/loculus/pull/1881/commits/0f1d4a2a91af1251847e3f0a42acc99361959328)
When I properly logged full backend errors, I noticed the following one (part of the "allow bulk deletion test"):
I'm not sure if this is just a test error, or this shows that our backend can end up in a bad state that we haven't noticed yet in manual testing. Would be good to investigate at some point. As far as I'm aware, the E2E tests don't have any special privileges with regards to the backend, so such a state could also be entered by real users.
To reproduce:
npx playwright test "pages/review/index.spec.ts" --repeat-each=10
It seems enough to just run lots of review tests in parallel to trigger this state. I know that the test will fail due to race conditions between tests (they just look at total counts and are not isolated from each other). However that doesn't mean that simultaneous uploading should produce an illegal backend state.