jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
290 stars 50 forks source link

Segmentation fault in step 1 (seqid2taxid parsing) #135

Open larssnip opened 3 years ago

larssnip commented 3 years ago

I have made a kraken2 database, and it works fine. However, when trying to build a bracken database from this I get a segmentation fault, and I believe it occurs in this line of kmer2reads_distr.cpp :

//Save in map seqid2taxid->insert(std::pair<string,int>(curr_seqid, curr_taxid));

(in void get_seqid2taxid()). I guess there is some attempt to add outside the capacity of the seqid2taxid data structure. The thing is my database contains 29 000 genomes, who are highly fragmented, i.e. many contigs. This makes the number of header-lines huge. The standard kraken2 bacteria library is bigger in terms of amount of sequence, but has only around 27000 headerlines (complete RefSeq genomes). My genomes have 9.5 million headerlines. Is this the problem here? I allocate 100GB for this job, and have tried 500GB without any effect, still crashes. Anyone seen this before?

larssnip commented 3 years ago

I sorted out this error, but leave a comment here for other people who might experience similar, and an idea for Jennifer how to easily avoid future user to spend the days (and nights) I did to sort this out…

The problem was in the taxonomy, the nodes.dmp file. This was created by combining the GTDB and NCBI taxonomy. As a result of some old/bad mapping between them downloaded from GTDB, this resulted in the following: The first column of nodes.dmp lists all tax_id values. The second column lists all parent_tax_id value. In a few cases, a value in the second column was not found in the first, i.e. a "parent" was listed that had no proper tax_id. When this was corrected, everything worked fine. But given the non-informative "segmentation fault" this was close to impossible to sort out.

Jennfer: In construct_taxonomy(), where the nodes.dmp is read, I suggest it would be nice to parse the nodes.dmp twice, and first check that all parent_tax_id's are indeed proper tax_id's i.e. the values in the second column are ALL found in the first column. Could also check that values in the first column (tax_id) are all unique (no repetitions), as this will also easily happen when customizing the taxonomy. This should be a small piece of code, and would take close to no extra time, but leave a proper error message to the user. I think that using a custom taxonomy is something more and more people will do (for good reasons).

YiyanYang0728 commented 1 month ago

I sorted out this error, but leave a comment here for other people who might experience similar, and an idea for Jennifer how to easily avoid future user to spend the days (and nights) I did to sort this out…

The problem was in the taxonomy, the nodes.dmp file. This was created by combining the GTDB and NCBI taxonomy. As a result of some old/bad mapping between them downloaded from GTDB, this resulted in the following: The first column of nodes.dmp lists all tax_id values. The second column lists all parent_tax_id value. In a few cases, a value in the second column was not found in the first, i.e. a "parent" was listed that had no proper tax_id. When this was corrected, everything worked fine. But given the non-informative "segmentation fault" this was close to impossible to sort out.

Jennfer: In construct_taxonomy(), where the nodes.dmp is read, I suggest it would be nice to parse the nodes.dmp twice, and first check that all parent_tax_id's are indeed proper tax_id's i.e. the values in the second column are ALL found in the first column. Could also check that values in the first column (tax_id) are all unique (no repetitions), as this will also easily happen when customizing the taxonomy. This should be a small piece of code, and would take close to no extra time, but leave a proper error message to the user. I think that using a custom taxonomy is something more and more people will do (for good reasons).

I also encountered this segmentation fault problem and have been debugging for half a day... I agreed with @larssnip and thanks for his suggestion. The error is very likely coming from the parent node is not found in a child node in "nodes.dmp" file. After I made a nodes.dmp file making sure all parent nodes can be found in the 1st column, bracken works very well. To fix this, you will need to provide a full path from a child node all the way up to the root node (ID=1) and add these lines to the nodes.dmp. That's my solution. Suppose you have a nodes.dmp like I have (I got it from https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp) covering almost all possible taxonomies paths, and you can extract the lines covering all the paths from the child node to the root by using the python script I provided here (a stupid one but works):

def read_nodes(file_path):
    """
    Reads the nodes.dmp file and returns a dictionary where the key is the child node
    and the value is the parent node.
    """
    nodes = {}
    lines = {}
    with open(file_path, 'r') as file:
        for line in file:
            cols = line.strip().split('\t')
            # print(cols)
            child, parent = cols[0], cols[2]
            nodes[child] = parent
            lines[child+"_"+parent] = line
    return nodes, lines

def get_path_to_root(child, nodes):
    """
    Given a child node and a dictionary of nodes, returns the path from the child node
    to the root, including all children and parents.
    """
    path = []
    current = child

    # Traverse up to the root
    while current in nodes:
        parent = nodes[current]
        path.append((current, parent))
        # if current == parent:  # Root node
        #     break
        if parent == "1":  # Root node
            break
        current = parent

    return path

def main(file_path, child_node):
    # Read nodes from the file
    nodes, lines = read_nodes(file_path)

    # Get the path to the root for the given child node
    path = get_path_to_root(child_node, nodes)
    # Print the path
    for child, parent in path:
        print(lines[child+"_"+parent])

# Usage
import sys
child_node = sys.argv[1] # Replace with the child node you want to query
file_path = sys.argv[2] # Replace with the actual path to your nodes.dmp file
main(file_path, child_node)

Usage: python <this python script> <child_Id or parent_Id> <a big nodes.dmp file covering all taxon paths> Hope this is useful for you.