loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

Ingest should not treat curator revisions as latest and revise #3084

Open corneliusroemer opened 1 week ago

corneliusroemer commented 1 week ago

Given curators might change insdc_ingest_user submitted sequences (something that wasn't considered when writing ingest pipeline) we need to avoid the following:

This may not actually happen with current code, it's just a possibility as it wasn't considered during development. Easy solution would be to filter out non-ingest-user-submitted sequences.

anna-parker commented 1 week ago

This is where the comparison is done: https://github.com/loculus-project/loculus/blob/main/ingest/scripts/compare_hashes.py#L153 - so we only compare the hash of the current metadata on NCBI to the hash of the latest version on ingest. So in this case the curator revising and the version being bumped should not cause ingest to update/push out another version unless the curator additionally curated/changed the hash. However, https://github.com/loculus-project/loculus/issues/3085 is a real risk

corneliusroemer commented 1 week ago

So in this case the curator revising and the version being bumped should not cause ingest to update/push out another version u

Unfortunately it does happen: https://main.loculus.org/seq/LOC_00037FR.3

I think your potentially wrong assumption is that submitted contains only the insdc_user submitted sequences, have you checked that? I wanted to do that but decided to do a quick experiment first and it turned out to show we are vulnerable to this right now.

corneliusroemer commented 1 week ago

We could do a username filter here: https://github.com/loculus-project/loculus/blob/6eb836612d04070930e7e8bb0df10db401bd8c58/ingest/scripts/call_loculus.py#L311-L312 or simply get the username and filter on it ourselves (username filter might not be supported by backend)

anna-parker commented 1 week ago

No... my assumption is that submitted should contain a map from insdc accessions to the corresponding loculus version and accession, which is what the general code description says: get_submitted(config: Config) - so probably there is a bug there

corneliusroemer commented 1 week ago

Revisions from superuser just also seem to count in here, doesn't really matter whether it's a bug or not, we assumed that no one else would submit other than insdc_ingest_user and now this requirement is changing.

anna-parker commented 1 week ago

I probably am missing sth. I don't understand how revisions from superuser being in the submitted list are an issue - in the hash comparison step ingest should only revise if the hash has changed. So if there is a new revision unless the hash changes (compared to the hash calculated on the data in INSDC) I didnt think ingest would revise. But it seems I am missing sth

corneliusroemer commented 1 week ago

I think (not sure) that superuser submissions show up in "/get-released-data", and they have "hash: null" or None or something which != somethingnonnull

anna-parker commented 1 week ago

Commenting with our findings from yesterday: This is exactly it - our current revision docs ask users to download the metadata via get-released-data and then upload with changes - the hash metadata field is not downloaded and thus the revisions will all have hash=null - leading ingest to think there has been a change when it compares hash=null with the hash it just calculated on the data it downloaded from NIH - it will them submit a revision - essentially overwriting the changes that the curator made. Great catch @corneliusroemer!