iqbal-lab-org / cobs

COBS - Compact Bit-Sliced Signature Index (for Genomic k-Mer Data or q-Grams)
https://panthema.net/cobs
MIT License
16 stars 2 forks source link

Result output truncates reference files at first dot #19

Open smehringer opened 1 year ago

smehringer commented 1 year ago

Hello,

We built an index over RefSeq genomes. The downloaded filenames are named like this:

/path/GCF_000019125.1_ASM1912v1_genomic.fna.gz
/path/GCF_000019165.1_ASM1916v1_genomic.fna.gz
...

When searching the index, the result looks as follows:

*query1 XXX
GCF_000019125 XXX
GCF_000019165 XXX
...

Luckily for us, the names are still unique and we should be able to compare the output with some effort to reconstruct the full reference name.

This format is lossy if the names weren't unique before the first dot and might even lead to severe false negatives if not noticed by the user.

Best, Svenja

iqbal-lab commented 7 months ago

thanks for pointing this out @smehringer . I don't understand why i didnt get notified of your comment. Will follow this up, but leandro has left the project so there will be a delay