hallamlab / MetaPathways

A modular pipeline for constructing Pathway/Genome Databases from environmental sequence information
http://hallam.microbiology.ubc.ca/MetaPathways
12 stars 7 forks source link

Gathering rRNA stats: correct number of rRNA sequences? #17

Closed nielshanson closed 11 years ago

nielshanson commented 11 years ago

The rRNA stats.txt files in $output/results/rRNA/ seem to report much fewer rRNA sequences than the original .blastout file might suggest. The following output finds 10 hits in the greengenes.blastout file:

shebop:blast_results nielsh$ wc input1.rRNA.greengenes.blastout
      10     120     592 input1.rRNA.greengenes.blastout

However, when I look at the input1.greengenes.rRNA.stats.txt file it is reported that only two rRNA sequences were found.

Similarity cutoff : 20.0
Length cutoff : 180
Evalue cutoff : 1e-06
Bit score cutoff :  50.0
Number of rRNA sequences detected:  2

    GREENGENES_gg16S-2012-11-06         
    start   end similarity  evalue  bitscore    taxonomy
input1_54   94  1347    88.7    0.0 1520.0  k__Bacteria;p__SR1;
input1_28   1   1345    81.33   0.0 1042.0  k__Bacteria;p__GN02;c__BB34;

Could you do a sanity check that this is actually correct and that the scripts are not misreporting anything?

-Niels

nielshanson commented 11 years ago

The BLAST is open-ended without any cutoffs and the .stats.txt file have the BLAST cutoffs applied.