pmelsted / bifrost

Bifrost: Highly parallel construction and indexing of colored and compacted de Bruijn graphs
BSD 2-Clause "Simplified" License
201 stars 25 forks source link

Tutorial on k-mer color API, my current use results in corruption? #34

Closed lrvdijk closed 3 years ago

lrvdijk commented 4 years ago

Hi,

Do you have any resources on how to use the k-mer/unitig color API in Bifrost? I have been playing around with it, and I think I understand it, but I'm encountering an issue where some unitigs have no colors associated with them anymore, or worse, the whole colorset is a nullptr.

For context: say I have a graph constructed from both a reference genome and WGS data from a different strain. I want to perform some graph cleaning, and identified a bunch of unitigs that have too low coverage in the sample and which I want to have removed, or at least not associated with the sample color anymore.

I've constructed the following example to do that: https://github.com/broadinstitute/pyfrost/blob/master/tests/test_node_removal.cpp

This example reads a file to_remove.txt which contains the head k-mer of a unitig to be removed from the sample on each line. First, I discard the sample color ID from that unitig, and if no colors remain, I queue it to be fully removed from the graph.

I save the cleaned graph to a file, and then read it again. Most nodes still have correct colors associated with them. For some nodes, however, the colorset will be a nullptr, resulting a crash when trying to do any operation, while for others the colorset is not a nullptr but doesn't contain any colors (which shouldn't happen because those unitigs should've been removed).

Am I using the API in an incorrect way? Is it a custom function I added to Bifrost in my fork that transforms any UnitigMapping to a mapping representing the whole unitig? A bug in Bifrost?

Any help would be much appreciated, thanks!

GuillaumeHolley commented 4 years ago

Hey @lrvdijk ,

Except the documentation itself, I do not have any resources on how to use the k-mer/unitig color API in Bifrost. I am in the process of writing a tutorial for the Bifrost API but there is quite some ground to cover and I have been busy with other projects recently. I will try to find the time for it soon.

I had a quick look at your code and it looks just fine to me, no incorrect usage of the API. In general, I think it is good practice after calling find() to check if the returned UnitigMap is empty (even if you know it is there) but otherwise, I don't see any mistake. Could you test right before line 22 and right after line 51 if you have unitigs associated with nullptr color sets? It would help me knowing if the issue is because of deleting colors, deleting unitigs, incorrect writing to file or something else.

lrvdijk commented 4 years ago

Thanks for your response!

I've added some extra tests, and it seems that after removing unitigs the colors become incorrect, the test consistently fails at the following line: https://github.com/broadinstitute/pyfrost/blob/master/tests/test_node_removal.cpp#L75

If you would like to try for yourself, the data is available here: https://github.com/broadinstitute/pyfrost/tree/master/tests/data

GuillaumeHolley commented 4 years ago

Hey @lrvdijk,

I am having a look at your issue at the moment. In the meantime, I notice the following issue on lines 14, 84 and 87 of your test program: you passing to all functions read() and write() arguments of type char* while they take string& in input. Compiler lets it fly but its gonna be an issue one day or another.

GuillaumeHolley commented 4 years ago

Hey @lrvdijk,

I think there is a problem with your color file to start with and that's the reason you get a weird behavior downstream in your test program. With multiple threads, I could read it but not join the color sets to their respective unitigs. With a single thread, I don't even pass the color reading:

I failed to find one of the right cookies. Found 142610454
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed alloc while reading
Aborted (core dumped)

The error message is a little bit cryptic but I know it is a CRoaring error, indicating there is something wrong with the file.

To investigate the issue further, I will need the raw files that were used to create the graph and its color sets. Could you tell me when was created the color file, with what OS and with how many threads?

GuillaumeHolley commented 4 years ago

Hi @lrvdijk,

Any chance I could have a look at those files? Thanks.

Guillaume

lrvdijk commented 4 years ago

Excuse me for the delay, here's a dropbox folder with the original data: https://www.dropbox.com/sh/xp0ehfhgzgynj94/AADEgixLc1XkqtT-TIZXcTUpa?dl=0

GuillaumeHolley commented 4 years ago

Thanks :)

GuillaumeHolley commented 4 years ago

Quick question @lrvdijk: did this issue occur on Linux or MacOS (or both) for you?

lrvdijk commented 4 years ago

I have only tested cleaning on MacOS.

Lucas van Dijk

On Tue, 18 Aug 2020 at 16:13, Guillaume Holley notifications@github.com wrote:

Quick question @lrvdijk https://github.com/lrvdijk: did this issue occur on Linux or MacOS (or both) for you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/bifrost/issues/34#issuecomment-675689437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF7UVK6SRAD33ISUIWDS3SBLOHXANCNFSM4PU5TKRA .

lrvdijk commented 4 years ago

I've tested it now on Linux with GCC too, but the same problem happens.

Bifrost build -r Mycobacterium_tuberculosis_H37Rv.fasta -c -o ref_graph
Bifrost update -g ref_graph.gfa -f ref_graph.bfg_colors -s F11-frags.concat.fq.gz -d -i -o F11-frags
GuillaumeHolley commented 4 years ago

Hey @lrvdijk,

Thanks for letting me know. I have been able to reproduce the issue (on Linux) on my side and I am working on it but unfortunately, not enough so far since I don't have a solution (yet). Here is what I've got:

It is not much but it is work in progress.

lrvdijk commented 4 years ago

Ah glad to hear you can reproduce it on your end!

Thanks a lot for all your work on this project and let me know if I can be of help.

I'll double check, but if I recall correctly I also encountered this problem when merging with a single thread.

GuillaumeHolley commented 3 years ago

Hi @lrvdijk,

I believe I have found the nasty bug and I pushed a bugfix for it. Unfortunately, any of the colored graphs produced over the last 2 months with Bifrost might have been affected. Sorry about that. Let me know as soon as possible if the bugfix fixes your problem.

lrvdijk commented 3 years ago

Amazing work!! I've successfully ran all my scripts without errors. Thanks a lot!

GuillaumeHolley commented 3 years ago

Awesome!