Open corneliusroemer opened 1 week ago
This is where the comparison is done: https://github.com/loculus-project/loculus/blob/main/ingest/scripts/compare_hashes.py#L153 - so we only compare the hash of the current metadata on NCBI to the hash of the latest version on ingest. So in this case the curator revising and the version being bumped should not cause ingest to update/push out another version unless the curator additionally curated/changed the hash. However, https://github.com/loculus-project/loculus/issues/3085 is a real risk
So in this case the curator revising and the version being bumped should not cause ingest to update/push out another version u
Unfortunately it does happen: https://main.loculus.org/seq/LOC_00037FR.3
I think your potentially wrong assumption is that submitted
contains only the insdc_user submitted sequences, have you checked that? I wanted to do that but decided to do a quick experiment first and it turned out to show we are vulnerable to this right now.
We could do a username filter here: https://github.com/loculus-project/loculus/blob/6eb836612d04070930e7e8bb0df10db401bd8c58/ingest/scripts/call_loculus.py#L311-L312 or simply get the username and filter on it ourselves (username filter might not be supported by backend)
No... my assumption is that submitted should contain a map from insdc accessions to the corresponding loculus version and accession, which is what the general code description says: get_submitted(config: Config) - so probably there is a bug there
Revisions from superuser just also seem to count in here, doesn't really matter whether it's a bug or not, we assumed that no one else would submit other than insdc_ingest_user and now this requirement is changing.
I probably am missing sth. I don't understand how revisions from superuser being in the submitted
list are an issue - in the hash comparison step ingest should only revise if the hash has changed. So if there is a new revision unless the hash changes (compared to the hash calculated on the data in INSDC) I didnt think ingest would revise. But it seems I am missing sth
I think (not sure) that superuser submissions show up in "/get-released-data", and they have "hash: null" or None
or something which != somethingnonnull
Commenting with our findings from yesterday: This is exactly it - our current revision docs ask users to download the metadata via get-released-data
and then upload with changes - the hash
metadata field is not downloaded and thus the revisions will all have hash=null
- leading ingest to think there has been a change when it compares hash=null
with the hash it just calculated on the data it downloaded from NIH - it will them submit a revision - essentially overwriting the changes that the curator made. Great catch @corneliusroemer!
Given curators might change insdc_ingest_user submitted sequences (something that wasn't considered when writing ingest pipeline) we need to avoid the following:
This may not actually happen with current code, it's just a possibility as it wasn't considered during development. Easy solution would be to filter out non-ingest-user-submitted sequences.