loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

Ingest: Potentially refactor ingest to hash metadata fields earlier? #2799

Closed anna-parker closed 1 month ago

anna-parker commented 1 month ago

As seen already in https://github.com/loculus-project/loculus/pull/2653 and https://github.com/loculus-project/loculus/issues/2788 we have incorrectly mapped or formatted certain insdc metadata fields to Loculus metadata fields, we also drop certain insdc metadata fields. There are no doubt more such cases.

The problem

Each time the resulting loculus metadata fields change the hash ingest calculates changes. Meaning that the version of that sequence will increase. However, we do not want to increase the version of a sample if the only changes are due to our mapping.

Suggestion

Hash ingested metadata earlier - preferably prior to any reformatting or mapping to Loculus metadata fields. This will initially cause a change of hashes and a version update (which we should probably revert), but afterwards we will not have to worry about fixing ingest metadata mappings, or using more insdc metadata fields.

Potential side effects

If an insdc metadata field that we do not use in Loculus is updated this will still cause a version increase.

corneliusroemer commented 1 month ago

I'm not sure I understand. We hash whatever we submit to Loculus. If something submitted to loculus changes we need to revise. That's what the current pipeline does and is what it should do. We should just give up on the idea of staying on version 1. If something submitted to Loculus changes it gets a version bump.

Maybe I'm missing some here, but as it is explained I don't get it.

Each time the resulting loculus metadata fields change the hash ingest calculates changes. Meaning that the version of that sequence will increase. However, we do not want to increase the version of a sample if the only changes are due to our mapping.

There is no way round this with our setup. Ingest is a submitter like any other. To change anything, submitters need to bump the version.

We can one off work around it as for authors with some effort and manual database override operations but we really don't want to do this very often if at all.

anna-parker commented 1 month ago

So what I am saying here is instead of calculating the hash in order to check if what we have submitted to Loculus from INSDC has changed (calculated in prepare_metadata.py on the results of our mapping from INSDC to Loculus fields) we should use the hash to check if the data we received from INSDC has changed - this is more in line with what we have agreed on should be versioned.

As you say this will mean that the hash value means sth different. Now we shall say that "Reingest will bump the version when there is a change in data in INSDC - not when there is a change in the data we submit to Loculus from INSDC"

anna-parker commented 1 month ago

From the discussion today it seems that the potential side effects of this restructuring are so not desirable that we should just continue with the current set up and actually allow version updates for ingest mapping changes.

corneliusroemer commented 1 month ago

We should discuss this in any case - you might have a point here I'm just not sure I understand it yet. Also close as "not planned" in future for things that are closed not because they were accompanied by real code changes but closed because we're not doing it :)

image
anna-parker commented 1 month ago

We discussed offline that even if we calculate the ingest hash on the data we get from INSDC we need to know if there has been a change relative to what we uploaded to loculus as we then need to revise sequences - so this doesn't make sense atm.