loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
34 stars 1 forks source link

Remove '-' from list of allowed symbols in unaligned sequences. #2728

Closed anna-parker closed 1 week ago

anna-parker commented 1 week ago

resolves #

preview URL: https://no-gaps-in-unaligned.loculus.org/

Summary

'-' only makes sense in the context of aligned sequences. It is not accepted by ENA and is not included in official IUPAC lists: https://genome.ucsc.edu/goldenPath/help/iupac.html#:~:text=The%20International%20Union%20of%20Pure,for%20either%20G%20or%20A).

fengelniederhammer commented 1 week ago

I think the backend has the same problem: https://github.com/loculus-project/loculus/blob/4df831f8d4295b17320773260c21d2ee9789e63a/backend/src/main/kotlin/org/loculus/backend/service/submission/ProcessedSequenceEntryValidator.kt#L58-L75

anna-parker commented 1 week ago

@fengelniederhammer the backend uses the nucleotide symbol list in the type check: validateNoUnknownNucleotideSymbol which is used for both unaligned and aligned sequences - it is a correct symbol for aligned sequences so I think it is ok to keep the backend as is.

fengelniederhammer commented 1 week ago

Isn't it the same issue? In aligned sequences - should be allowed, in unaligned it is not allowed. We could easily also introduce a new list in the backend that splits the validation.