loculus-project / loculus

An open-source software package to power microbial genomic databases
https://loculus.org
GNU Affero General Public License v3.0
37 stars 2 forks source link

Backend fasta parser does not treat fasta headers with spaces correctly (according to spec) #2923

Open corneliusroemer opened 1 month ago

corneliusroemer commented 1 month ago

According to FASTA spec, the start of the header up to the first white space is the sequence id, everything after is description.

The backend currently seems to not follow the specs. Example, note sequence with header custom5 1, whose id should be parsed as custom5 and be accepted.

Body = {"type":"about:blank","title":"Unprocessable Entity","status":422,"detail":"Metadata file contains 1 submissionIds that are not present in the sequence file: custom5; Sequence file contains 1 submissionIds that are not present in the metadata file: custom5 1","instance":"/dummyOrganism/submit"}

submissionId    date    region  country division    host
custom4 2020-12-03  Europe  Switzerland Zürich  Homo sapiens
custom0 2020-12-26  Europe  Switzerland Bern    Homo sapiens
custom1 2020-12-15  Europe  Switzerland Schaffhausen    Homo sapiens
custom2 2020-12-02  Europe  Switzerland Bern    Homo sapiens
custom6 2020-12-16  Europe  Switzerland Aargau
custom3     Europe  Switzerland Schaffhausen    Homo sapiens
custom5 2020-12-23  Europe  Switzerland Basel-Land  Homo sapiens
custom7 2XXXXX  Europe  Switzerland Sankt Gallen    Homo sapiens
custom8 2020-12-16  Europe  Switzerland Aargau  Homo sapiens
custom9 2020-12-01  Europe  Switzerland Basel-Stadt Homo sapiens
>custom1
ACTG
>custom2
ACTG
>custom3
ACTG
>custom4
ACTG
ACTG
>custom7
ACT
ACT
>custom8
ACTG
>custom9
AC
>custom6
AC

>custom5 1
ACTG
>custom0
ACTG
chaoran-chen commented 1 month ago

What is the FASTA spec.? I found various definitions but couldn't identify a proper standard.

corneliusroemer commented 1 month ago

There's no single spec, but in many ways there's a rough consensus: https://www.ncbi.nlm.nih.gov/genbank/fastaformat/ or https://pacbiofileformats.readthedocs.io/en/13.0/FASTA.html

Also see: https://www.biostars.org/p/11254/