jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
294 stars 50 forks source link

seqid2taxid can't find seqid #258

Closed eric9n closed 7 months ago

eric9n commented 7 months ago

https://github.com/jenniferlu717/Bracken/blob/8800e4bfd0c984fec46d3f8bf9fad2129d52d3e9/src/kraken_processing.cpp#L113

//Find delimiter indices
    int pos1, pos2, pos3, pos4, pos5;
    pos1 = line.find("\t");
    pos2 = line.find("\t", pos1+1);
    pos3 = line.find("\t", pos2+1);
    pos4 = line.find("\t", pos3+1);
    pos5 = line.find("\n", pos4+1);
    //Extract seqid and taxid 
    seqid = line.substr(pos1+1, pos2-pos1-1);
    taxid = seqid2taxid->find(seqid)->second;
    string curr_ks = line.substr(pos4+1, pos5-pos4-1);

seqid2taxid->find(seqid)->second;

This place may not be able to find seqid for sure.

In Kraken, seqid2taxid is the mapping of NCBI sequence IDs to taxonomic IDs, which is known to be in the NCBI format. On the other hand, Kraken output results are encoding IDs from fastq sequencing files. The connection between these two lies in the fact that Kraken uses the NCBI sequence IDs from the input fastq files to identify and classify the taxonomic content of the sequences. Therefore, the one-to-one correspondence is established based on the NCBI sequence IDs present in both the input fastq files and the seqid2taxid mapping provided by Kraken.