pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

Number of FASTA/FASTQ files in input can indeed be 0 #65

Closed Krannich479 closed 1 year ago

Krannich479 commented 1 year ago

Hi Guillaume, I am playing around with minimal (not yet) working examples of your new Bifrost-v1.2.0 release. I am trying to use the new ColoredCDBG<>::read() function as introduced on Jun 11, 2022 (with 96761d9392ca72b798999c0d3663370caee380c8) to read a graph from gfa, bfi and color.bfg files.

Problem: My small example fails due to no FASTA/FASTQ being present.

    ColoredCDBG<> ccdbg;
    bool build_status = false;

    build_status  = ccdbg.read(opt.filename_graph_in, opt.filename_index_in, opt.filename_colors_in, opt.nb_threads, opt.verbose);
    build_status &= ccdbg.buildGraph(opt);
    build_status &= ccdbg.buildColors(opt);

    if (!build_status){
        std::cout << "Error building the graph" << std::endl;
        return 1;
    }

Error log:

ColoredCDBG::read(): Reading graph.

CompactedDBG::read(): Reading graph from disk

CompactedDBG::read(): Finished reading graph from disk
ColoredCDBG::read(): Reading colors.

DataStorage::read(): Reading color sets from disk
ColoredCDBG::loadColors(): Joining unitigs to their color sets.
CompactedDBG::build(): Number of FASTA/FASTQ files in input cannot be 0.
ColoredCDBG::buildColors(): Graph is invalid (maybe not built yet?) and colors cannot be mapped.
Error building the graph

Expected behavior: When using graph files (gfa, bfi, color.bfg), I shouldn't need any FASTA/FASTQ.

Proposal: I traced this case down to the function CompactedDBG<>::build() (overwritten in class ColoredCDBG). Maybe the condition in the snippet below is out of date for Bifrost's colored dBG? https://github.com/pmelsted/bifrost/blob/c15674193efeeef395ae5ff9b02a3c4805e02f96/src/CompactedDBG.tcc#L553

GuillaumeHolley commented 1 year ago

Hi @Krannich479,

I appreciate the detailed description of the issue, I'll look into it asap. I'm currently on my way back to Iceland, hopefully I can push a fix for this on Friday at the earliest ;)

I'll keep the issue open in the meantime and will update as soon as the fix is pushed.

Thanks, Guillaume

GuillaumeHolley commented 1 year ago

Hi @Krannich479,

I had a quick look into it and I think this is actually the expected behavior no matter what version of Bifrost you use. What you want to do here is to read the graph and its colors, which is simply achieved by the first call to ColoredCDBG::read(), after that the graph is fully loaded in memory with its colors and ready to use. Subsequent calls to buildGraph() and buildColors() will clear the graph you just read and try to build a new graph with colors out of input FASTA/FASTQ you do not have, hence the error message.

Guillaume

Krannich479 commented 1 year ago

Sorry for the confusion and thanks for clarification! Verified: I deleted the buildGraph() and buildColors() functions and the minimum working example works now of course.

GuillaumeHolley commented 1 year ago

Great :) Will close this issue but as always, don't hesitate to open another one if you have issues/questions!

GuillaumeHolley commented 1 year ago

If I may add one last thing: I changed the reading functions in ColoredCDBG such that a failed reading would put the graph in an invalid state (which apparently was not the case before).