ksahlin / strobemers

A repository for generating strobemers and evalaution
71 stars 11 forks source link

Multi-fasta handling #5

Closed moorembioinfo closed 2 years ago

moorembioinfo commented 2 years ago

Hi,

Sorry if I've missed this detail in the documentation but I was wondering how strobemers handles multi-fasta* files of assembled genomes? I don't want to concatenate as this would create artificial overlaps.

Thanks in advance for any help with this.

Best,

Matt

*

Contig1 ATAGCAGATAGC... Contig2 ATACGACATAGC... etc.

ksahlin commented 2 years ago

Hi @moorembioinfo,

Are you using the tool StrobeMap or a function in the strobemers library?

If you use the tool StrobeMap, it handles multifasta files. StrobeMap takes a set of references and queries and map all queries to references. On that note, I strongly recommend using the C++ implementation of StrobeMap rather than the Python implementation because of speed and memory.

If you mean using the strobemer functions in the index.cpp or indexing.py modules, then these function take only a single sequence as function argument, so you would have to implement/use a fasta parser function to iterate over records in a multi fasta.

Best, Kristoffer

moorembioinfo commented 2 years ago

Hi Kristoffer,

Thanks for your rapid response, that answers my question. I can see now that the python implementation (indexing.py) handles different contig sizes passed to it.

Congratulations on a great tool!

Best,

Matt