pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

[feature request] dumping color file into parsable format #50

Closed damiankao closed 11 months ago

damiankao commented 3 years ago

It would be great if we can dump the color file into a text format or something easier to parse so we can analyze the graph.

rhysnewell commented 1 year ago

I would also appreciate this, alongside the index file if possible. Or perhaps some instruction on how to interpret the byte strings in the colours file?

cgroza commented 1 year ago

This would be useful for creating phylogenies based on k-mer sharing between colours using metrics like Jaccard distance.

cgroza commented 1 year ago

The right solution to this is to use the Bifrost API. I have made an attempt here, maybe it is useful to others: https://github.com/cgroza/bifrost_jaccard

GuillaumeHolley commented 11 months ago

Hi everyone,

@cgroza Thank you for your implementation!

There is now another fairly simple solution to do this:

  1. Make a FASTA file of each k-mer in the segments of the GFA file. Assuming k=31:

    zcat mygraph.gfa.gz | awk 'BEGIN {K=31} {if ($1=="S"){LEN_KM_UNITIG=length($3)-K+1; for (i=1; i<=LEN_KM_UNITIG; i+=1){print ">" $2 "_" i "\n" substr($3,i,K)}}}' > mygraph.kmers.fasta

    Every record in the generated FASTA file has a name with the form >x_y where x is the unitig ID (in the GFA) the k-mer is from and y is the position (1-based) of the k-mer within that unitig.

  2. Query the colored graph using the previously generated GFA

    Bifrost query -v -t 16 -e 1.0 -g mygraph.gfa.gz -C mygraph.color.bfg -q mygraph.kmers.fasta -o mygraph.colors

    The output file will be mygraph.colors.tsv which is a matrix (k-mers x colors). The intersection of a row (k-mer) and column (color) contains a binary value indicating whether the corresponding k-mer is present (1) or not (0) in the sample matching the corresponding color.

Given that Bifrost graphs are now fully indexed (.bfi file output alongside the .gfa) and are very fast to load in memory, this solution should take no time to run.