pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

[Q] pruning of an existing .gfa #79

Closed Louis-MG closed 9 months ago

Louis-MG commented 9 months ago

Hello ! I built unitigs using bcalm2. Now I am interested in pruning the DBG to see if I can obtain longer unitigs by removing short clips. I thought I could use the update command of bifrost, but it requires a sequence input file and a reference input file, and I cant really link these concepts to my data (which was just a fasta of kmer, used to obtain the unitigs). I then saw that I could maybe build a DBG from the unitigs file, and output a fasta with:

Bifrost build --input-seq-file ./input/case_kmers.unitigs.fa -o ./output/case_kmers.unitigs.pruned -t 10 --fasta-out --clip-tips

But the resulting fasta file was empty with no error (also, it is unclear if the fasta would contain kmers or unitigs ?). Because of the nature of unitigs I know it is impossible that all were removed with the --clip-tips option, as some were larger than 31. I do not understand why and wondered if the pruning of an existing GFA could theoretically be done without any fasta file as a mandatory input ? If no, why ?

I guess I should build the graph with the original kmers.fasta, prune the short clips, output a kmers.pruned.fasta and use it with bcalm2, correct ? Thank you for your answers, to my numerous questions !

GuillaumeHolley commented 9 months ago

Hi @Louis-MG,

Indeed, the update command is akin to an add feature and can only be used to add more sequences to a graph. In your case, you want to trim so this is not what you need.

Your intuition with build is correct and this is the command that you need. However, --input-seq-file tells Bifrost that your input is a set of reads and hence, Bifrost only kept the k-mers occurring twice or more in your input. Which is why you get an empty output (without errors) since your input graph only contains k-mers occurring once. What you need is instead --input-ref-file which tells Bifrost to keep all input k-mers.

Keep in mind that with --fasta-out, Bifrost will output the resulting graph in FASTA format but without the edge information in the FASTA record's header like BCALM2 does. If you need the edge information, you will have to output the graph in GFA (and eventually parse it back to the FASTA-like format that BCALM2 outputs).

Let me know if this helps, Guillaume