sapporo-wes / tataki

Command line tool for detecting life science data types.
Apache License 2.0
4 stars 2 forks source link

Chimera files which start with FASTA and then other formats are incorrectly detected as FASTA #7

Open fmaccha opened 4 months ago

fmaccha commented 4 months ago

Files that have FASTA format followed by other formats, like show below, are incorrectly detected as FASTA.

>ref
AGCATGTTAGATAAGATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
>ref2
aggttttataaaacaattaagtctacagagcaactacgcg
@HD     VN:1.0 SO:coordinate
./chimera.txt:
  decompressed:
    id: null
    label: null
  label: FASTA
  id: http://edamontology.org/format_1929

This happened because noodles::fasta read the sam header line at the end as FASTA sequence. Its documentation says,

FASTA is a text format with no formal specification and only has de facto rules. It typically consists of a list of records, each with a definition on the first line and a sequence in the following lines.