pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

How is unitig sequence strandedness decided #42

Closed samhorsfield96 closed 2 years ago

samhorsfield96 commented 3 years ago

Hi,

I have a quick query about how unitig sequences are stored in the graph. How is the strandedness of the unitig sequence stored within the graph decided upon? Is it always the same as the strandedness of the input sequence, or can it vary depending on sorting of constituent k-mers e.g. lexicographically?

GuillaumeHolley commented 2 years ago

Hi @samhorsfield96,

I am going over the unanswered issues of the Bifrost repository and I see that I never replied to you about this one (shame on me). If it is not too late, quick answer is that there is no specific criterion to decide on the strandness of a unitig sequence. And vene if there was one, you shouldn't rely on this property. Long answer is that it depends on the strandness of the first kmer read from the input files and starting the extension of a approximate unitig when querying the Bloom filter. When using a single thread, input files are read from top to bottom so this should be deterministic. However, when using multiple threads, each thread deals with its own chunk of reads from the input files, which can be in any order, so this is not deterministic in multi-threaded mode (output graph is always deterministic though).

Guillaume