Generate standardized sequence names

loculus-project / loculus

An open-source software package to power microbial genomic databases

https://loculus.org

GNU Affero General Public License v3.0

35 stars 2 forks source link

Generate standardized sequence names #1487

Closed emmahodcroft closed 3 months ago

emmahodcroft commented 6 months ago

Apologies if this issue already exists, I couldn't find it from searching but there are many ways it might be listed!

This would be to generate a standardized sequence name during preprocessing, which may be much easier for people to use as tip-labels, to refer to in text, etc. For example: SARS-CoV-2/host/country/sampleID/date

It would be nice if we could have this for Alpha because people could give us some feedback on it, and we may catch any cases where the names come out weird-looking, as people try out test data, but could also be introduced during Alpha.

chaoran-chen commented 6 months ago

Sample ID would be our accession right? Should it contain the version?

theosanderson commented 6 months ago

I think the thinking is that sample ID is more the lab's ID for the sequence

chaoran-chen commented 6 months ago

If it's the ID that the submitter chooses, we'd need a lot more specification because that ID is at the moment entirely unregulated. Which characters do we want to allow for the sequence name (only alphanumerics? What about non-latin (e.g., Chinese) characters?), what's the length? What about collisions?

theosanderson commented 6 months ago

IMO:

only alphanumerics
limit to e.g. 15

Collisions:

IMO we could tackle these post-MVP since they need us to make a query (and even that is a bit imperfect)
I would just add -2 -3, to the end

At some stage we could potentially ask each group to provide an e.g. up to 5 character prefix that would be part of this ID.

IMO this is something we can iterate on a bit, and we shouldn't stop implementing something making a start because we think it's too hard to do perfectly (not saying that you are saying that!)

chaoran-chen commented 6 months ago

How would you deal with non-alphanumerics? Replace with a symbol like _?

I agree that having some nice sequence names would be useful (as people and existing tools are used to such names) but for the alpha/beta phase, I feel that it would be good enough to include the accession instead of the custom sequence ID which is more complicated/not straightforward to implement.

corneliusroemer commented 6 months ago

Lab is quite important to have in there as well, but to keep things simple as a start, we could go with: country/loculusaccver/date

Post MVP

I'd say pathogen comes from context, host usefulness depends.

For location, using sub-country level is often very useful as well.

I could imagine something like: countryiso-subdiviso/labid/submitterid/colldate

Information density matters a lot, hence using iso codes, this also solves Problems like Cote D'Ivoire and spaces in strain names.

Example: DEU-BW/RKI/H20K23/2022-04-24

I consume these strain names on a daily basis scouring SC2 data and this is good practice. They don't separate lab id and sample id, and allow spaces so it's an iteration on it.

We should ask for the short lab identifier at group creation, ideally it'd be unique among groups,

chaoran-chen commented 3 months ago

As discussed in our current call, for the MVP:

The format suggested by Cornelius sounds good: country/loculusaccver/date
The field should be called display_name