pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
199 stars 24 forks source link

unitigs with no colors #73

Open hangsuUNC opened 11 months ago

hangsuUNC commented 11 months ago

Hi,

Thanks for this wonderful tool! I'm writing to ask a question about the color of unitigs. I found about 3% of the unitigs created from Bifrost is of no colors. Is this because these unitigs do not exist in any of the samples (randome recombinations between a set of kmers)? I used pyfrost to load the graph and the color files to do the analysis

Here is the command: Bifrost build -t 16 -k 31 -c -s -r -o Pyfrost: g = pyfrost.load() nodelist = list(g.nodes) no_colors = 0 for node in nodelist: try: colors = g.nodes[node]['colors'] except: no_colors += 1

Results:

image

Thanks for your help!

Best,

Hang

GuillaumeHolley commented 11 months ago

Dear @hangsuUNC,

In color mode, it is guaranteed that each k-mer gets at least one color so if you have k-mers or unitigs without colors, it is a bug from either Bifrost or Pyfrost (which is developed independently from Bifrost). I am not familiar with the Pyfrost syntax but I had a quick look on the Pyfrost README and I saw that to iterate over pairs of k-mers and colors, the syntax is:

for n, data in g.nodes(data=True):
    for c in data['colors']:
        print("Node", n, "has color", c)

From your code, it seems that data=True is not used and that you don't access colors per k-mer but per unitig instead which I am not sure is possible. Could you look into that first?

Thanks, Guillaume

hangsuUNC commented 11 months ago

Hi Guillaume,

Thanks for your reply! I tried: ` nodes_info = []

for n, data in g.nodes(data=True):

num = 0

for c in data['colors']:

    num += 1

nodes_info.append(["Node", n, "has color", c])

`

Got an error message:

image

Not sure what does that mean... Will contact pyfrost author later!

Thank you!

Hang

GuillaumeHolley commented 11 months ago

Unfortunately, I don't know why this fails. I think that contacting first the Pyfrost author is a good idea. Make you link this issue in the one you are going to create and I'll assist with any Bifrost related test or bug.

hangsuUNC commented 11 months ago

Thank you so much, Guillaume! Will do!

Best,

Hang

GuillaumeHolley commented 10 months ago

Hi @hangsuUNC,

I'll close this for now since it is unclear at the moment if the issue is with Pyfrost or Bifrost. Don't hesitate to reopen or link to this issue if there is some progress on the matter.

Guillaume

hangsuUNC commented 8 months ago

Hi Guillaume,

Sorry for the delay of response! I contacted the pyfrost author @lrvdijk and examined different versions of Bifrost and its output graph. Lucas created a test C++ program using the Bifrost API directly (so not using the pyfrost Python library), and it still fails with kmers of no colors. It sounds like a Bifrost color matrix issue instead of the python library issue.

Could you please help check the color matrix for bifrost graph? If there is any additional information you need to test, please let us know!

Thanks a lot for your help!

Best,

Hang

Hi @hangsuUNC,

I'll close this for now since it is unclear at the moment if the issue is with Pyfrost or Bifrost. Don't hesitate to reopen or link to this issue if there is some progress on the matter.

Guillaume

lrvdijk commented 8 months ago

For reference, here's the C++ test program (using Catch2 test framework, but you get the idea):

#include <iostream>
#include <unordered_set>

#include <catch2/catch.hpp>
#include <ColoredCDBG.hpp>
#include <Kmer.hpp>

TEST_CASE("Test unitig color data", "[unitig_color_data]") {
    CCDBG_Build_opt opt;
    opt.filename_graph_in = "data/MT_graph_Bfrost_graph.gfa";
    opt.filename_colors_in = "data/MT_graph_Bfrost_graph.bfg_colors";

    ColoredCDBG<> ccdbg(opt.k, opt.g);
    ccdbg.read(opt.filename_graph_in, opt.filename_colors_in, 2);
    auto total_num_colors = ccdbg.getColorNames().size();

    ofstream anchors;
    anchors.open("data/anchors.txt");

    for(auto const& um : ccdbg) {
        auto colorset = um.getData()->getUnitigColors(um);
        std::cout << "Testing colorset of " << um.getMappedHead().toString() << std::endl;
        REQUIRE(colorset != nullptr);

        std::unordered_map<size_t, size_t> colors_per_kmer{};

        for(auto it = colorset->begin(um); it != colorset->end(); ++it) {
            colors_per_kmer.emplace(it.getKmerPosition(), 0).first->second++;
        }

        for(auto const& p : colors_per_kmer) {
            if(p.second == total_num_colors) {
                anchors << um.getUnitigKmer(p.first).toString() << std::endl;
            }
        }
    }

    anchors.close();
}

Graph is created with Bifrost <1.2 and the test is also run with the same pre-1.2 Bifrost version.

For many k-mers, the colorset pointer is fine, but for some it's not.

Screenshot 2023-11-08 at 3 12 55 PM

(When using the Python wrapper, the same k-mer fails too).

hangsuUNC commented 8 months ago

bifrost_graphs.zip

I attached the bifrost graphs here for your reference!

Thanks in advance!

Hang

GuillaumeHolley commented 8 months ago

Hi @hangsuUNC,

I am reopening the issue. Would it be also possible for you to share the input data used to build the graph as well as the exact Bifrost version/commit used? Thanks!

Guillaume

hangsuUNC commented 8 months ago

Hi Guillaume,

Thanks for your reply! Here is the construction command:

Bifrost build -t ~{num_threads} -k ~{kmersize} -i -d -c -s ~{sep=" -s " fas} -r ~{ref} -o ~{outputpref}_Bfrost_graph

The docker I use is listed here: 1) hangsuunc/bifrost:v1 (Bifrost 1.2.0 ) 2) us-central1-docker.pkg.dev/broad-dsp-lrma/fusilli/fusilli:devel (Bifrost 1.0.6.5)

All of the outputs are of the same issue...

Here is the input file merged into a single fasta: all.fasta.gz

Thanks again for your help!

Best,

Hang