jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
286 stars 50 forks source link

total reads in combine_bracken_outputs.py #265

Open maAr517 opened 4 months ago

maAr517 commented 4 months ago

I found output data of this python script exhibited the number of "added_reads" in bracken output file as a read number, probably because "estreads" is used as a total count in line 119 and 120, but is it correct? It seems that "areads" is appropriate.

I am a beginner on bioinformatics, so I appreciate it if anyone made any suggestion or instruction.

jenniferlu717 commented 3 months ago

Im not 100% understanding what you are confused by.

Bracken is adding more reads to the Kraken read counts but when combining the reports to compare them, we care most about the est_reads (estimated reads after combining) not just the added reads.

maAr517 commented 3 months ago

Thank you for replying.

I totally agree with you on the point that estimated reads after combining should be cared. And I'm so sorry, I found my concerning is not a common problem to other users. I don't know why, but the script in my PC was modified.

I concerned the definition of estreads. In my bracken output file (the product of "-o" option), columns are "name", "taxonomy_id", "taxonomy_lvl", "kraken_assigned_reads", "added_reads", "new_est_reads", and "fraction_total_reads" from the first column to the last column. But, in my "combine_bracken_outputs.py" (lines 105-122),

        #Process line 
        [name, taxid, taxlvl, kreads, estreads, areads, frac] = line.strip().split("\t")
        estreads = int(estreads) 
        #Error Checks
        if name not in sample_counts:
            sample_counts[name] = {}
            sample_counts[name][taxid] = {}
        elif taxid != list(sample_counts[name].keys())[0]:
            sys.exit("Taxonomy IDs not matching for species %s: (%s\t%s)" % (name, taxid, sample_counts[name].keys()[0]))
        if len(level) == 0:
            level = taxlvl 
        elif level != taxlvl:
            sys.exit("Taxonomy level not matching between samples")
        #Save counts
        total_reads[curr_name] += estreads
        sample_counts[name][taxid][curr_name] = estreads 
    #Close file 
    i_file.close()

So I thought that the term "aread" (I assumed "a" meant "all") was appropriate, but in https://github.com/jenniferlu717/Bracken/blob/master/analysis_scripts/combine_bracken_outputs.py, there is no problem.

        #Process line 
        [name, taxid, taxlvl, kreads, areads, estreads, frac] = line.strip().split("\t")
        estreads = int(estreads) 

I'm sorry again for consuming your time with such a insignificant issue and thank you again for developing such a useful tool!