seqan / raptor

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.
https://docs.seqan.de/raptor
Other
52 stars 18 forks source link

Not properly parsing NCBI assembly filenames #355

Closed pirovc closed 1 year ago

pirovc commented 1 year ago

Platform

Description

raptor prepare eats parts of the filename with an dot. e.g. the common file name for genome assemblies from NCBI GCF_029338575.1_ASM2933857v1_genomic.fna.gz turns into GCF_029338575.minimiser and .header. That not only makes it impossible to track back to the file, but also may break or behave unexpectedly using two versions of the same assembly (GCF_029338575.1 and GCF_029338575.2)

How to repeat the problem

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029/338/575/GCF_029338575.1_ASM2933857v1/GCF_029338575.1_ASM2933857v1_genomic.fna.gz
find . -name "*.fna.gz" > files.txt
raptor prepare --input files.txt --output tmp

Expected behaviour

GCF_029338575.1_ASM2933857v1_genomic.minimiser and GCF_029338575.1_ASM2933857v1_genomic.header should be created

Actual behaviour

GCF_029338575.minimiser and GCF_029338575.header are created