pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

Unitigs with no color #74

Closed aryakaul closed 11 months ago

aryakaul commented 11 months ago

Hello,

Apologies if I'm misunderstanding something but I am attempting to extract the unitigs by color matrix from Bifrost's output. My current approach is adapted from #50.

zcat bifrost_out.gfa.gz | awk '{if ($1=="S") {print ">" $2 "\n" $3}}' > unitigs.fasta
Bifrost query -v -t 32 -e 1.0 -g bifrost_out.gfa.gz -C bifrost_out.color.bfg -q unitigs.fasta -o unitigs.colors

But when I query the associated unitigs.colors.tsv I occasionally get unitigs with no colors (0 0):

 ➜ cat unitigs.colors.tsv | cut -f2- | sed 1d | sort | uniq -c
    466 0       0
   1253 0       1
    675 1       0
   1394 1       1

From my understanding, every unitig should have at least one color associated with it, otherwise it should not appear in the GFA? Any help or advice would be appreciated!

Best, Arya

GuillaumeHolley commented 11 months ago

Hi @aryakaul,

By using -e 1.0 in your search, you only report a queried unitig as "present" in one color if all of its composing k-mers have that color. So what your query file shows is that you have 466 unitigs for which not all the k-mers are colored with color 1 or color 2. An example of why this happens is if a k-mer x from sample 1 overlaps (by k-1 base pairs) a k-mer y from sample 2. The resulting colored graph will contain a unitig with x and y for which color(x)=1 and color(y)=2. Hence, not all k-mers of that unitig would have color 1 or 2.

If you redo the same experiment by querying k-mers rather than unitigs as shown in #50, all your queries should show at least one color.

If this is of interest for your use case, the Bifrost version on the robin branch (which is the next Bifrost release) adds extra search options where in addition to reporting a queried sequence as "present" or "absent" using a fixed threshold -e, you can instead reports either the number of k-mers found or the ratio of found k-mer per color. However, you would have to redo a graph as this version breaks file format compatibility with current Bifrost release,

Let me know if this helps, Guillaume

aryakaul commented 11 months ago

Thank you! This is very helpful! I've been working with small toy datasets up until now, and didn't run into this so thought I had done something wrong. Noted about the robin branch, I'll take a crack at building that.

Best, Arya