linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
130 stars 40 forks source link

Bug in ReadBedtoos function #166

Closed phingchian closed 2 months ago

phingchian commented 4 months ago
def ReadBedtoos(filename):
    lines = open(filename).readlines()
    seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[1:]}
    normalized_tpm = 0.
    for seqid in seqid2info:
        seqid_depth = seqid2info[seqid]
        normalized_tpm += seqid_depth.read_count/seqid_depth.length
    return seqid2info,normalized_tpm

The *.depth.txt generated has no header - slicing lines[1:] will omit the first entry. Suppose the first entry is a valid CAZyme, it will fail the check at get_length_readcount() and exit prematurely. Kindly check.

phingchian commented 4 months ago

Just to add, the *.depth.txt file is generated following the below example in the user guide:

seqkit fx2tab -l -n -i ../prokka_Wet2014/Wet2014.ffn | awk '{print $1"\t"$2}' > Wet2014.length
seqkit fx2tab -l -n -i ../prokka_Wet2014/Wet2014.ffn | awk '{print $1"\t"0"\t"$2}' > Wet2014.bed
bedtools coverage -g Wet2014.length -sorted -a Wet2014.bed -counts -b ../samfiles/Wet2014.CDS.bam > Wet2014.depth.txt
ZhengJinfang1220 commented 4 months ago
def ReadBedtoos(filename):
    lines = open(filename).readlines()
    seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[1:]}
    normalized_tpm = 0.
    for seqid in seqid2info:
        seqid_depth = seqid2info[seqid]
        normalized_tpm += seqid_depth.read_count/seqid_depth.length
    return seqid2info,normalized_tpm

The *.depth.txt generated has no header - slicing lines[1:] will omit the first entry. Suppose the first entry is a valid CAZyme, it will fail the check at get_length_readcount() and exit prematurely. Kindly check.

Thank you for identifying this bug. Regardless of whether the first line is a CAZyme or not, it should be counted in the total count of mapped reads.