jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
273 stars 50 forks source link

Segmentation fault in STEP 4: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS #54

Open thbtmntgn opened 5 years ago

thbtmntgn commented 5 years ago

I built a Kraken1 database and I'm trying to generate the corresponding bracken distribution file.

Everything went well until I got a segmentation fault at STEP 4 (CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS).

I got the following error message:

>>STEP 4: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
        <READ LENGTH>mers, with a database built using <KMER LENGTH>mers
        XXX sequences converted...bracken-build: line 155: 105733 Segmentation fault 

What could cause that issue?

I'm using Bracken version 2.2.

jenniferlu717 commented 5 years ago

I'm not 100% sure honestly, did you try running the program steps individually and not using the shell script?

bsiranosian commented 5 years ago

Hi, I'm having this same issue on a particular database I'm making. I was able to get it down to the kmer2read_distr program. Running the build command on a single core allows it to repeatedly fail at the same point... Suggesting to me a particular genome is to blame here.

@jenniferlu717 can you help me debug this? Maybe printing out the name of the sequence the program is working on, so we can go in and eliminate it from the database.

       >>STEP 0: PARSING COMMAND LINE ARGUMENTS
                Taxonomy nodes file: /.../genbank_custom/taxonomy/nodes.dmp
                Seqid file:          /.../genbank_custom/seqid2taxid.map
                Num Threads:         1
                Kmer Length:         35
                Read Length:         100
        >>STEP 1: READING SEQID2TAXID MAP
                138769 total sequences read
        >>STEP 2: READING NODES.DMP FILE
                2043483 total nodes read
        >>STEP 3: READING DATABASE.KRAKEN FILE
                138745 total sequences read
        >>STEP 4: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
                100mers, with a database built using 35mers
                64959 sequences converted...Segmentation fault (core dumped)

Launched with the following command:

DATABASE=/.../genbank_custom
KMER_LEN=35
READ_LEN=150
THREADS=8
DIR=~/software/Bracken
$DIR/src/kmer2read_distr --seqid2taxid $DATABASE/seqid2taxid.map --taxonomy $DATABASE/taxonomy/ --kraken $DATABASE/database.kraken --output $DATABASE/database${READ_LEN}mers.kraken -k ${KMER_LEN} -l ${READ_LEN} -t ${THREADS}
python $DIR/src/generate_kmer_distribution.py -i $DATABASE/database${READ_LEN}mers.kraken -o $DATABASE/database${READ_LEN}mers.kmer_distrib
djcabral92 commented 5 years ago

Hello,

I am running into the same problem and am having trouble diagnosing it. I am running the program steps separately and I receive a segmentation fault error anytime I run the kmer2read_distr command. Does anyone have suggestions of how to fix it?

jenniferlu717 commented 5 years ago

Sorry to everyone! I havent looked at this yet but I will today. I hope to have some answer or fix in the next few hours.

jenniferlu717 commented 5 years ago

This may be more difficult than I hoped. It's not obvious to me what the segmentation fault may be, especially since it does not happen when I run on the servers I'm testing with.

I suspect it occurs when the memory available is not enough for the number of genomes.

bpfeffer commented 4 years ago

I'm having the same problem, with both versions 2.2 and 2.5. The server I'm doing the build on has 3TB of RAM(and 192 threads) and I'm not seeing spikes in memory use during step 4 before it fails.

One thing I tested was varying the number of threads I was providing it, and it came back with odd results. The number of sequences converted in step 4 before it seg-faulted was consistent for a set amount of threads, very little variance for say 10 threads each time, but there was huge variance across the different number of threads. Examples using version 2.5(step 3 reporting 510,068 total sequences, 100 read length, 35 kmer length)): 30 threads segfaults at ~100,000 sequences converted 19 threads ~108,000 38 threads ~218,000 39 threads ~36,000 40 threads ~180,000 50 threads ~59,000 60 threads ~73,000 70 threads ~18,000 80 threads ~240,000 90 threads ~280,000 98 threads ~37,000 99 threads ~1,100 100 threads 0 sequences converted 101 threads ~133,000 102 threads ~122,000

It goes up and down, with a few others reporting under 1000 sequences converted. It just seems weird there would be so much variance just for a multithreading variable. Any help would be appreciated.

bsiranosian commented 4 years ago

I know it's not much help, but I was able to eventually get the bracken database to build. I went back and reconstructed the kraken db (with the same genomes) and then bracken built without complaints.

kyclark commented 4 years ago

I think I am having the same errors. I was able to trace the error to kraken_processing.cpp in the function "convert_distribution" in the for loop. The program would only run for a minute or so before segfault. When I commented out pragma omp parallel for, it ran far beyond the point of failure I was seeing. I've resubmitted the job to my HPC to see if it will finish this time.

kyclark commented 4 years ago

Update: I edited kraken_processing.cpp to comment out all the #pragma omp lines. I recompiled and was able to get the program to run for about 5 hours before hitting the same segfault. Since it wasn't able to run the for loop in parallel, perhaps it just took longer to get to the same problem it had before?

Looking at the code, I can see extensive use of maps in read_kfile so that convert_distribution can randomly access data in the parallelized for loop. I'm not a C programmer, but this seems to me the crux of the issue. I don't get segfaults in the languages I use and I can't debug from a core file, but it would seem that improper use of memory is a constant problem in C and C++ and is a typical cause for segfaults.

I also see a lot of code that I really don't understand. For instance in read_kfile, a tab-delimited file is being read line-by-line. The code manually looks for the first tab character with line.find("\t"), then uses that value to look for the next and the next and so on. There's no code that checks that any tabs are ever found, just assumption after assumption that the previous code worked. Why not split the entire line on tabs and check the number of fields? Is there not a standard C++ library for reading tab-delimited files that has robust error checking in place?

Likewise in the convert_distribution function, the code reads through each of the kmer/count values and does this:

//Split up this pair into the taxid and the number of kmers
mid = curr_pair.find(":");
end = curr_pair.find("\n");

What if there is no colon? It took me a while to understand how the space-delimited field of kmer/counts was being split by std::stringstream, but I still don't understand how there could be a newline for any but the last kmer/count pair. If either of those find calls fails, how can you use the values in the following substr calls? I would think some sort of string split function that would break the colon-delimited pair in two. You should verify you have both parts before moving on.

If I understand the code correctly, my input file that this program is processing is about 8M in size. Is this a large file for this program? All in all, I think the program should not be written in C++. It's a language notoriously difficult to get right, esp concerning multi-threaded code. It might be safer (and just a slow) to write it in Python where there are at least common libraries for handling tab-delimited data and splitting strings. For speed and accuracy, however, I've found Rust, to be an amazing language. I don't mean to be disrespectful, but I can't see how to get this program to run as it is currently written. I'm not at all convinced it could or should be written in C++.

novitch commented 4 years ago

Same issue for me ... No solution? Because I really wanted to test your soft on my data I am pretty sure it will decrease my false positive rate :)

Thanks, Alban.

jenniferlu717 commented 4 years ago

I made some changes that can minimize the memory accesses but I think I need to look at it further. Try the newest code and let me know?

Its hard for me to debug this because I cannot seem to recreate this error.

novitch commented 4 years ago

Hi, thanks, I tried another server where i can submit jobs and it seems to work well, not finished yet, but far away where I had the segmentation fault ... finger crossed

elegodi commented 4 years ago

Hello all, I am running bracken in a server and I have the same segmentation fault as other people here. Any advise on how to solve it? Is this because the number of threads used? Thanks

novitch commented 4 years ago

Sorry not on my side, I was lucky I could run with succes on another server with less RAM. I don't understand why that worked this time

novitch commented 4 years ago

I made some changes that can minimize the memory accesses but I think I need to look at it further. Try the newest code and let me know?

Its hard for me to debug this because I cannot seem to recreate this error.

I tested on my problematic server with the new code but still have a segmentation fault :

        >>STEP 3: READING DATABASE.KRAKEN FILE
                625618150 sequences/home/matalb01/src/Bracken/bracken-build: line 166:  9586 Processus arrêté      $DIR/src/kmer2read_distr --seqid2taxid $DATABASE/seqid2taxid.map --taxonomy $DATABASE/taxe${READ_LEN}mers.kraken -k ${KMER_LEN} -l ${READ_LEN} -t ${THREADS}
luhugerth commented 4 years ago

Has anyone managed to tweak kraken_processing.cpp for it to print out the taxon/sequence which causes the segfault? Or gotten any information at all from a log file?

My build crashes already on the sixth sequence with version 2.0 and on the 0th for version 2.5. It didn't crash when I only had RefSeq sequences, but started crashing once I added sequences from GenBank. I also expanded the RefSeq repertoire a bit, so I'm not sure if it's related to GenBank or not...

Since we're all working with custom databases, can it have something to do with the seqid2taxid.map files? What happens if Bracken tries to process a sequence and doesn't find it in the mapping file?

luhugerth commented 4 years ago

Has anyone managed to tweak kraken_processing.cpp for it to print out the taxon/sequence which causes the segfault? Or gotten any information at all from a log file?

I have also tried building a DB only one genome at a time, to try to identify particular genomes that create a crash; but even when I found one, repeating the same exact operation in the same machine did not lead to any errors...

Debugging ideas, anyone?

ZhongyiZhu commented 4 years ago

I'm having the same problem, the more threads i used, the faster i can meet the segmentation fault error ,and the less databaseXmers.kraken file it can make.

if i used 60 threads in a matchine compatibility 64 threads , it will reporte segmentation fault immediately with no databaseXmers.kraken file build.

my bracken version is 2.5.3, as the issuse post above, it may be no use to change the softeware version.

any help?

ZhongyiZhu commented 4 years ago

hi, everyone, I finally successful build the bracken dababase, for me ,it is because I used inconsistance Kraken version between I build the kraken database and Bracken database. (eg: build kraken db in v2.0,9 and build bracken in karken v2.0.7 ),

thankyou @bsiranosian , I do what you suggested and then realized the problem.

wee-gibbon commented 4 years ago

Hi guys,

I was having this problem and found that it seemed to be caused by something totally unrelated interrupting my first attempt to build the library, meaning the database.kraken file was created but maybe not successfully completed, then when rerunning bracken-build, I think it registers this faulty database.kraken and tries to use it anyway. I deleted any files that were created during that first (and many many subsequent) failed attempts, restarted the job from the same kraken database and it seems to be running just fine now past the place I usually received a segmentation fault error.

Hopefully this helps others too!

MatteoSchiavinato commented 3 years ago

Hi, I'm having the same issue. My kraken2 database was built with the same version that bracken in trying to use (2.0.9-beta) and regardless of how much memory and threads I specify, it always crashes at the 4th step.

Any idea?

EDIT: Doing the separate steps as suggested in the BRACKEN manual worked. Not sure why. This is was problematic step:

python generate_kmer_distribution.py -i database${READ_LEN}mers.kraken; -o database${READ_LEN}mers.kmer_distrib
andrewdavis3 commented 3 years ago

Running git clone of the repo worked for me. Pulling the source tarball, I got the Segmentation fault error.

brandoninvergo commented 2 years ago

I have been experiencing this too. Upon debugging, I can confirm that the segmentation fault occurs due to an attempt to access unmapped memory in a call to taxonomy::get_lvl_num() in function get_classification at line 247 in kraken_processing.cpp.

This is probably because n1 and n2 are not sanity-checked after they are initialized in lines 244 and 245 but before they're accessed in 247. In theory, the node might not be found. These need to be checked if !n1 || !n2 and something appropriate needs to be done in that case.

Unfortunately, as I only have my real data and no appropriate test data, it takes me about 7 hours to generate the segfault, so testing is a bit difficult. I'll implement the above check and just continue if it occurs. I do not know if that is the appropriate action, though (or, indeed, if the bug points to a bigger problem in the algorithm which this would just cover up). I'll report back when that is finished.

E.g.:

        //Break ties                                                                            
        taxonomy * n1 = taxid2node->find(it->first)->second;
        taxonomy * n2 = taxid2node->find(max_taxid)->second;
        if (!n1 || !n2)
                      continue;
        //Get to the same level                                                                 
        while (n1->get_lvl_num() > n2->get_lvl_num()) {
                    if (n1->get_parent() == NULL){
            cerr << n1->get_taxid() << endl;
                    }

                    n1 = n1->get_parent();
        }
        while (n1->get_lvl_num() < n2->get_lvl_num()) {
                    if (n2->get_parent() == NULL){
            cerr << n2->get_taxid() << endl;
                    }
                    n2 = n2->get_parent();
        }
brandoninvergo commented 2 years ago

Putting the above checks in place allows kmer2read_distr to complete, however it then passes the problem on to generate_kmer_distribution.py. In particular, where the segfault would have happened (I think), the database*mers.kraken file has a line that ends in [taxid]: (e.g. 575614: with no kmers count). This breaks line 79 in generate_kmer_distribution.py because no checks are in place for an error that produces missing kmer counts.

This happens for me in the human database for the sequence "kraken:taxid|9606|NC_000013.11".

I am still debugging and will update this when I've found more.

brandoninvergo commented 2 years ago

Sorry, my earlier supposition was wrong. The bug is probably here:

               while (n1->get_lvl_num() > n2->get_lvl_num()) {
                    if (n1->get_parent() == NULL){
            cerr << n1->get_taxid() << endl;
                    }

                    n1 = n1->get_parent();
        }

If n1->get_parent() == NULL, then the loop should break, otherwise n1->get_lvl_num() will segfault. The same goes for the second loop.

Since this only happens at taxid = 1, i.e. the root node, then the LCA condition after the loops needs to be updated too, so:

               while (n1->get_lvl_num() > n2->get_lvl_num()) {
                    if (n1->get_parent() == NULL){                                        
                      break;
                    }

                    n1 = n1->get_parent();
                }
                while (n1->get_lvl_num() < n2->get_lvl_num()) {
                    if (n2->get_parent() == NULL){                                         
                        break;
                    }
                    n2 = n2->get_parent();
                }
                //Find LCA                                                                              
                while (n1 != n2) {
                  if (n1->get_parent())
                    n1 = n1->get_parent();
                  if (n2->get_parent())
                    n2 = n2->get_parent();
                }
brandoninvergo commented 2 years ago

Sorry for having gone the route of "reporting on the fly". I have now to all appearances fixed this bug. I will try to put together a pull request later.

After digging into the behavior a bit, I found that (referring to the code I have previously included), in some cases nodes n1 or n2 can be set to an invalid node with regards to the full taxonomy.

That is, look at this line: taxonomy * n1 = taxid2node->find(it->first)->second; There are taxonomic IDs that do not map to nodes. So taxid2node->find(taxid)->second for certain values of taxid does not return a valid node. In these cases, calling n1->get_taxid() returns -1 and n1->get_parent() returns NULL because the node is not attached to the taxonomy. In the aforementioned while loops, n1 = n1->get_parent() is called without checking for NULL, so the next iteration n1->get_lvl_num() is called on a null pointer, producing the segfault.

In my case, I noted that the two taxonomic IDs that caused the problem are 5092 and 1182571. The former has been migrated to a new ID and the latter doesn't exist. I don't know where to look to figure out why these IDs are even being referenced. But at least we can prevent them from breaking the database creation.

/***************************************************************************************/
/*METHOD: Process the array of the kmer distributions*/
int get_classification(vector<int> *curr_kmers, const taxonomy *my_taxonomy, const map<int, taxonomy *>\
 *taxid2node) {
/*...skipping ahead to line 234...*/
        //Find the maximum score                                                                        
        int max_score = 0;
        int max_taxid = 0;
        for (map<int, int>::iterator it=taxid2scores.begin(); it!=taxid2scores.end(); ++it){
            taxonomy * n1 = taxid2node->find(it->first)->second;
            if (n1->get_taxid() == -1)
                continue;
            if (it->second > max_score){
                //Get max scoring                                                                       
                max_score = it->second;
                max_taxid = it->first;
            } else if (it->second == max_score){
                //Break ties                                                                            
                taxonomy * n2 = taxid2node->find(max_taxid)->second;
                //Get to the same level                                                                 
                while (n1->get_lvl_num() > n2->get_lvl_num()) {
                    if (n1->get_parent() == NULL){
                      break;
                    }
                    n1 = n1->get_parent();
                }
                while (n1->get_lvl_num() < n2->get_lvl_num()) {
                    if (n2->get_parent() == NULL){
                        break;
                    }
                    n2 = n2->get_parent();
                }
                //Find LCA                                                                              
                while (n1 != n2) {
                  if (n1->get_parent())
                    n1 = n1->get_parent();
                  if (n2->get_parent())
                    n2 = n2->get_parent();
                }
                max_taxid = n1->get_taxid();
            }
        }
        return max_taxid;
    }
}

So, we check right away if iterator it refers to an invalid taxonomic ID and move on to the next value if it does. I retain all of the safety checks that I have previously discussed.

As mentioned, I'll try to put together a patch and pull request when I get a chance.

conmeehan commented 1 year ago

I just want to say this is still an issue and indeed is fixed with the taxonomy check in kraken_processing.cpp. It means that Bracken cannot be sued from Conda as is requires compiling from source. It would be great if these lines of code were added to the conda version so it works seamlessly.

theo-allnutt-bioinformatics commented 1 year ago

This error is still occurring. Is the software now unsupported? I am using a latest version of both bracken and kraken2 for dbs.

    696 sequences converted (finished: NC_019274.1)/g/data/nm31/bin/Bracken-2.8/bracken-build: line 170: 1175372 Segmentation fault      $DIR/src/kmer2read_distr --seqid2taxid $DATABASE/seqid2taxid.map --taxonomy $DATABASE/taxonomy/ --kraken $DATABASE/database.kraken --output $DATABASE/database${READ_LEN}mers.kraken -k ${KMER_LEN} -l ${READ_LEN} -t ${THREADS}
conmeehan commented 1 year ago

Did you rebuild the kraken_processing.cpp as outlined above? That fixed it for me.

jenniferlu717 commented 3 months ago

@theo-allnutt-bioinformatics I am behind on fixing bugs for this software but it is by no means unsupported. Pull requests help the most for me if others are able to identify the issues.

This general error was caused when I switched to c++. I believe I originally had this running in python but it takes a significantly longer amount of time, hence why the switch was made.

@brandoninvergo thank you so much for your thorough investigation. I'm doing my best to fix the bugs .

emilyvansyoc commented 3 months ago

Hello, I'm running into the same problem using both a micromamba install and git clone install. I am running this based on the kraken2 fungal database. When installing with git clone and running the install script, when trying to run 'bracken-build' it gives this error: Bracken/bracken-build: line 229: syntax error: unexpected end of file (let me know if you'd like me to open that in a separate issue)

When I broke it down and ran individual steps as specified in the manual, I get through the same Segmentation fault error as has been described above. This repeats in both a conda install and the Git clone install, and does not vary based on memory or threads for the submitted job on a SLURM system.

I'm having trouble interpreting all of the debugging happening above. Is there a quick fix for this and a chunk of code that I can manually insert into the kmer2read_distr script based on a specific comment in the above thread?