hzi-bifo / RiboDetector

Accurate and rapid RiboRNA sequences Detector based on deep learning
GNU General Public License v3.0
96 stars 16 forks source link

No classification at all for some data and classification for some #49

Closed EorgeKit closed 6 months ago

EorgeKit commented 6 months ago

Dear @dawnmy @alicemchardy @foobarx @fernandomeyer @TRKlingen . Many thanks for this wonderful tool. I have tried to get it to work for me for the first time and I get conflicting results. I have three metagenomics dataset. Two of the are 16S amplicon sequencing data and one is shotgun data. Upon running ribodetector in all the three dataset, I only get outputs with classification from one of the 16S dataset:data2 and the outputs for the other 16S dataset:data1 contain nothing and so is the outputs from the shotgun dataset despite running the same code in all the dataset. The reason I ran on the 16S dataset as well is I was trying to benchmark to see if the misbehave on the shotgun dataset was only for shotgun or itmight not classify even for the amplicon sequencing data.

Below is the script I am running and I have provided the second amplicon dataset that produces no results: data1, also its worth noting that initially I was running with -e rrna then I changed to norrna as suggested in one of the issues for metagenomics dataset just to see if I might get different results:

#!/bin/bash
#PBS -l select=2:ncpus=24:mpiprocs=24:mem=120gb
#PBS -N detecting_rRNA_DATA_3
#PBS -q normal
#PBS -P CBBI1470
#PBS -l walltime=8:00:00

eval "$(conda shell.bash hook)" 
conda activate ribodetector

##DATA LOCATION
data1_1=/mnt/lustre/users/maloo/euMwanza/dataset/S1_S1_L001_R1_001.fastq.gz
data1_2=/mnt/lustre/users/maloo/euMwanza/dataset/S1_S1_L001_R2_001.fastq.gz
data3_1=/mnt/lustre/users/maloo/euMwanza/dataset/D_1.fastq.gz
data3_2=/mnt/lustre/users/maloo/euMwanza/dataset/D_2.fastq.gz
data2_1=/mnt/lustre/users/maloo/euMwanza/dataset/S2_S1_L001_R1_001.fastq.gz
data2_2=/mnt/lustre/users/maloo/euMwanza/dataset/S2_S1_L001_R2_001.fastq.gz
out_folder=/mnt/lustre/users/maloo/euMwanza/riboDetector_analysis

ribodetector_cpu -t 48 \
  -l 120 \
  -i $data3_1 $data3_2 \
  -e norrna \
  --log $out_folder/D3.log \
  -r $out_folder/D3_reads.rRNA.R1.fq $out_folder/D3_reads.rRNA.R2.fq \
  -o $out_folder/D3_reads.nonrrna.R1.fq $out_folder/D3_reads.nonrrna.R2.fq

# ribodetector_cpu -t 48 \
#   -l 120 \
#   -i $data1_1 $data1_2 \
#   -e norrna \
#   --log $out_folder/D1.log \
#   -r $out_folder/D1_reads.rRNA.R1.fq $out_folder/D1_reads.rRNA.R2.fq \
#   -o $out_folder/D1_reads.nonrrna.R1.fq $out_folder/D1_reads.nonrrna.R2.fq

# ribodetector_cpu -t 48 \
#   -l 120 \
#   -i $data2_1 $data2_2 \
#   -e norrna \
#   --log $out_folder/D2.log \
#   -r $out_folder/D2_reads.rRNA.R1.fq $out_folder/D2_reads.rRNA.R2.fq \
#   -o $out_folder/D2_reads.nonrrna.R1.fq $out_folder/D2_reads.nonrrna.R2.fq

Here is the output of ls -lhtr of the output folder:

otal 478M
-rw-r--r-- 1 maloo maloo    0 Mar 20 21:17 detecting_rRNA_DATA_3.o5463984
-rw-r--r-- 1 maloo maloo    0 Mar 20 21:17 detecting_rRNA_DATA_2.o5463981
-rw-r--r-- 1 maloo maloo    0 Mar 20 21:17 detecting_rRNA_DATA_1.o5463980
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:17 D1_reads.rRNA.R2.fq
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:17 D1_reads.rRNA.R1.fq
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:17 D1_reads.nonrrna.R2.fq
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:17 D1_reads.nonrrna.R1.fq
-rw-rw-r-- 1 maloo maloo  686 Mar 20 21:17 D1.log
-rw-r--r-- 1 maloo maloo 2.6K Mar 20 21:17 detecting_rRNA_DATA_1.e5463980
-rw-r--r-- 1 maloo maloo  13K Mar 20 21:18 detecting_rRNA_DATA_2.e5463981
-rw-rw-r-- 1 maloo maloo 239M Mar 20 21:18 D2_reads.rRNA.R2.fq
-rw-rw-r-- 1 maloo maloo 239M Mar 20 21:18 D2_reads.rRNA.R1.fq
-rw-rw-r-- 1 maloo maloo 581K Mar 20 21:18 D2_reads.nonrrna.R2.fq
-rw-rw-r-- 1 maloo maloo 506K Mar 20 21:18 D2_reads.nonrrna.R1.fq
-rw-rw-r-- 1 maloo maloo  890 Mar 20 21:18 D2.log
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:21 D3_reads.rRNA.R2.fq
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:21 D3_reads.rRNA.R1.fq
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:21 D3_reads.nonrrna.R2.fq
-rw-rw-r-- 1 maloo maloo    0 Mar 20 21:21 D3_reads.nonrrna.R1.fq
-rw-rw-r-- 1 maloo maloo  688 Mar 20 21:21 D3.log
-rw-r--r-- 1 maloo maloo  794 Mar 20 21:21 detecting_rRNA_DATA_3.e5463984

Its also worth noting that the log files contain no errors for any of the dataset.Here is an example of data3 log file:

2024-03-20 21:17:21 : INFO  Using high RECALL model
2024-03-20 21:17:21 : INFO  Log file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D3.log
2024-03-20 21:21:23 : INFO  33568314 sequences loaded!
2024-03-20 21:21:23 : INFO  Writing output rRNA sequences into file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D3_reads.rRNA.R1.fq, /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D3_reads.rRNA.R2.fq
2024-03-20 21:21:23 : INFO  Writing output non-rRNA sequences into file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D3_reads.nonrrna.R1.fq, /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D3_reads.nonrrna.R2.fq

data1 log file:

2024-03-20 21:17:21 : INFO  Using high RECALL model
2024-03-20 21:17:21 : INFO  Log file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D1.log
2024-03-20 21:17:29 : INFO  943825 sequences loaded!
2024-03-20 21:17:29 : INFO  Writing output rRNA sequences into file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D1_reads.rRNA.R1.fq, /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D1_reads.rRNA.R2.fq
2024-03-20 21:17:29 : INFO  Writing output non-rRNA sequences into file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D1_reads.nonrrna.R1.fq, /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D1_reads.nonrrna.R2.fq

data2 log file which produces classification results:

2024-03-20 21:17:32 : INFO  Using high RECALL model
2024-03-20 21:17:32 : INFO  Log file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D2.log
2024-03-20 21:17:37 : INFO  441124 sequences loaded!
2024-03-20 21:17:37 : INFO  Writing output rRNA sequences into file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D2_reads.rRNA.R1.fq, /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D2_reads.rRNA.R2.fq
2024-03-20 21:17:37 : INFO  Writing output non-rRNA sequences into file: /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D2_reads.nonrrna.R1.fq, /mnt/lustre/users/maloo/euMwanza/riboDetector_analysis/D2_reads.nonrrna.R2.fq
2024-03-20 21:18:26 : INFO  Writing outputs...
2024-03-20 21:18:28 : INFO  Detected 1695 non-rRNA sequences.
2024-03-20 21:18:28 : INFO  Detected 439429 rRNA sequences.

data2 and data1 can be found here:https://www.dropbox.com/scl/fo/l4fugkpayd02fayupelxk/h?rlkey=tdj95u75sup42na34ir4adjbn&dl=0

I believe the solution to data1 will provide insights to why data3 is failing. I couldn't upload it as its too big and I don't have the space in dropbox.

Thanks in advance

dawnmy commented 6 months ago

Hello, upon reviewing the log, it appears that data1 and data3 have not been completed. Could you please confirm whether these processes had finished when you last checked the output?

EorgeKit commented 6 months ago

Yes, completed or atleast no job was running

EorgeKit commented 6 months ago

I am noticing now that data1 log has this at the end:BS: job killed: mem 125971228kb exceeded limit 125829120kb

dawnmy commented 6 months ago

Can you then add --chunk_size 256 to the command line?

EorgeKit commented 6 months ago

yes I can, and it worked. Thanks a lot.

dawnmy commented 6 months ago

Great to hear it worked for you! If you prefer to run it without chunking the data, you'll need to increase the mem setting in the PBS script. However, using a chunk_size shouldn't significantly slow down the process, though it means you won't see the nice progress bar.