hzi-bifo / RiboDetector

Accurate and rapid RiboRNA sequences Detector based on deep learning
GNU General Public License v3.0
94 stars 16 forks source link

False postivies when using on metagenomic data? #21

Closed rezenman closed 1 year ago

rezenman commented 1 year ago

Hey ! Thanks for the efforts made developing this useful tool. I am currently using it to seperate rRNA from Metagenomic data derived from whole genome sequencing of mock population constructed from 12 known species. After extraction, I am aligning the rRNA reads to the Silva most recent database, expecting a high percentage of reads to align. In order to do that I am using bbseal (which uses kmers), in a reltively loose way allowing up to 2bp hamming distance. However, I am getting that only 70% of the reads identified as rRNA are aligned to the Silva database. Can you help me understand why ? This is the script used:

ribodetector_cpu -t 8 \ -l 151 \ -i $read1 $read2 \ -e norrna \ --chunk_size 256 \ -o "$folder"_reads.nonrrna.1.fq "$folder"_reads.nonrrna.2.fq \ -r "$folder"_reads_rRNA.1.fq "$folder"_reads_rRNA.2.fq

Thanks a lot, Shahar

dawnmy commented 1 year ago

Hi Shahar,

I assume that bbseal search with hamming distance <=2bp is too strict for rRNA because Silva does not contain rRNAs from all species. There are many unknown species in metagenome data for which only related species are available in Silva. They will be likely missed by bbseal. You can try https://www.arb-silva.de/aligner/, RDP, or blastn which may give better classification. Be aware that you are using all the rRNA types as reference (all SSU and LSU). Hope this helps

dawnmy commented 1 year ago

BTW, you can set -e rrna which will have better precision for rRNA

rezenman commented 1 year ago

Hey, thanks for the quick answer I understand what you're saying about the missing species, however, since it is a mock population sample constructed only from known species I assume there aren't many unknown species there. I'll definitely try and use the tool with the-e rrna flag instead BTW, I tried also with the -e none flag and I get fewer rRNA reads and as a result higher percentage of reads which do align to the silva database, so it might be the solution in my case, any thoughts? Thanks, Shahar

dawnmy commented 1 year ago

Yes, the number of reads for rRNA output is like: -e norrna >= -e none >= -e rrna. i.e. for rRNA prediciton, norrna is the most loose option, and the most strict is rrna.

rezenman commented 1 year ago

Great, thanks a lot ! And one last question, do you have any recommendation on how to differntiate between LSU and SSU? I am only interested in the 16S sequence and therefor only the SSU is relevant for me

dawnmy commented 1 year ago

I would blast the output rRNA reads against the 16rRNA sequences from https://rrndb.umms.med.umich.edu/static/download/ or https://greengenes.secondgenome.com/?prefix=downloads/greengenes_database/gg_12_10/ to retrieve the 16S rRNA reads.