hzi-bifo / RiboDetector

Accurate and rapid RiboRNA sequences Detector based on deep learning
GNU General Public License v3.0
96 stars 16 forks source link

RiboDetector is identifying reads generated from ChrX and ChrY as rRNA reads #37

Closed rajanbit closed 1 year ago

rajanbit commented 1 year ago

Dear @dawnmy @alicemchardy, We are using RiboDetector to identify rRNA reads from human WGS data. We found that reads from ChrX, ChrY and Chr2 along with reads from other chromosomes are being identified as rRNA reads, whereas in human rRNA genes are present only on Chr1, Chr13, Chr14, Chr15, Chr21 and Chr22. The majority of reads are being correctly identified but few of them are being misclassified. Furthermore when synthetic reads (with 20X coverage) were generated using T2T-CHM13v2.0 human reference genome and inputted, it was found that the number of misclassified reads increases to many folds. We have used both the GPU and CPU version of the tool and the results are same.

Here is the command we are using

$ ribodetector -t 8 -l 150 -i sample_R1.fastq.gz sample_R2.fastq.gz -e rrna --chunk_size 256 -o output_R1.fastq.gz output_R2.fastq.gz -r rRNA_R1.fastq.gz rRNA_R2.fastq.gz

Can you please suggest:

  1. The possible reason(s) for this misclassification.
  2. How to increase its accuracy on our datasets ?
  3. How can we train RiboDetector using our own datasets (gene sequences and WGS reads) ?
andreiprodan commented 1 year ago

@rajanbit was wondering what % of falsely labelled rRNA reads do you get (% of total human raw reads fastq)?

rajanbit commented 1 year ago

On a real WGS dataset: RiboDetector identified 0.053% reads as rRNA reads, of all the predicted rRNA reads 14% reads were mapped with non-specific chromosomes. To check whether this is due to mapping error. We stimulated synthetic reads with chromosome name and location info in header.

On simulated dataset using T2T-CHM13v2.0 (read length 150bp, 20x coverage, zero base error and zero snp/indels): RiboDetector identified 0.066% reads as rRNA reads, and in this case 34% of the predicted rRNA reads mapped to non specific chromosomes. As we had the chromosome names for each reads, we confirmed that the reads originating from chr2, chrX, chryY, etc. are getting predicted as rRNA reads.

dawnmy commented 1 year ago

Thank you for taking the time to test RiboDetector. Your feedback is invaluable to us. In addition to your test, I also ran an experiment using simulated PE 150 reads from the human ChrX, and found that approximately 0.04% of the reads were misclassified as rRNA. This misclassification rate is already quite low, although it can be further improved by including T2T genome in the training dataset.

The training dataset for our model used the hg38 version of the human genome, rather than the T2T version. As a result, there may be some previously unseen human sequences (features) in the T2T genome that could be misclassified. This is an aspect we would refine in our future updates.

In general, these misclassified reads do not tend to be enriched in any particular functional groups. Consequently, in the majority of cases, such misclassifications should not impact the downstream functional analysis significantly.

rajanbit commented 1 year ago

Dear @dawnmy Thank you for your response. I am closing this issue.