sortmerna / sortmerna

SortMeRNA: next-generation sequence filtering and alignment tool
https://sortmerna.readthedocs.io
GNU General Public License v3.0
234 stars 69 forks source link

sortmerna 4.0 processing stuck #212

Closed QianqianJena closed 2 years ago

QianqianJena commented 4 years ago

Dear SortMeRNA users, I have forward and reverse files with 100 million reads separately and I used SortMeRNA 4.0 to sort mRNA using the below command: sortmerna -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/silva-bac-16s-id90.fasta -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/silva-bac-23s-id98.fasta -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/silva-arc-16s-id95.fasta -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/silva-arc-23s-id98.fasta -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/silva-euk-18s-id95.fasta -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/silva-euk-28s-id98.fasta -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/rfam-5s-database-id98.fasta -ref /home/te73huq/other/sortmerna-2.1/rRNA_databases/rfam-5.8s-database-id98.fasta -reads /home/te73huq/other/qc/IS1_f_qc.fastq.gz -reads /home/te73huq/other/qc/IS2_r_qc.fastq.gz -workdir /home/te73huq/IS1_mrna_4.0 --other /home/te73huq/IS1_mrna_4.0/IS1_mrna_non.fastq.gz --paired_in -v -a 25 But it takes 10 days to finish this sample. Did I do something wrong? Why did it take so much time? If you find something wrong, please let me know. I appreciate your help in advance. Thanks and I look forward to your reply. Best regards, Qianqian

biocodz commented 4 years ago

@QianqianJena Hello, Do you see any output (execution trace) of the program like the one here Sample-execution-trace If yes, could you publish it?

Please, also provide the size ls -l reads_file, line count wc -l reads_file of your reads file in the format: name : size : line count Also, what's the output of file reads_file?

QianqianJena commented 4 years ago

Dear SortMeRNA group,

Thanks for your reply.

Please have a look at the attached execution-trace file.

Here is the detailed report:

Input forward file:IS13_f.fastq:34936244849:400155328.

Input reverse file:IS13_f.fastq:34558730639:400155328.

Output mRNA file:IS13_other.fastq:4212633783:51077492.

I feel sorry to delete the aligned fastq file.

If you need further information, please let me know.

Thank you very much.

Best regards,

Qianqian sortmrna.txt

biocodz commented 4 years ago

One problem is immediately seen:

[KeyValueDatabase:18] Path '/home/te73huq/sortmerna/run/kvdb' exists with the following content:
"OPTIONS-000005"
"LOG"
"LOG.old.1580404316710430"
"CURRENT"
"MANIFEST-000005"
"000006.log"
"IDENTITY"
"LOCK"
"OPTIONS-000008"

The kvdb directory Has to be Purged prior each new run (see here Work_directory)

Otherwise not only the run time is ridiculous, but the alignment results are non-valid too. It will be changed in the future releases, and the kvdb will be reused, but for now it has to be purged. Probably we have to just interrupt the execution if the kvdb is not empty (for the time being)

QianqianJena commented 4 years ago

Thanks for your reply and it makes sense now. Best, Qianqian

QianqianJena commented 4 years ago

Hello Biocodz, These days I run sortmerna for other samples and it runs 3-4 days for one sample. The attached is the execution-trace file. It looks everything is ok but slow. I am wondering if it is possible to speed up the alignment step since it takes so long time during the process. Thank you in advance. Best, Qianqian

sortmerna_4days.txt

biocodz commented 4 years ago

Some info is chopped off the trace, so I cannot see how many CPUs your machine has. Is there a particular reason your are using 8 threads? Can you increase the number, or may me even run with the default (all cores) I created the summary of the run below

num reads: 100,415,918

ref                           hash                  size           sec        min     hr
silva-bac-16s-id90.fasta      15734375058464002811  19,437,013     19589.84   326.48  5.44
silva-bac-23s-id98.fasta      17299952793705614139  12,911,743      7313.21   121.88  2.03
silva-arc-16s-id95.fasta      3436099190853847617    3,893,959      3047.73    50.78  0.87
silva-arc-23s-id98.fasta      3400685301612210653      752,022       370.21     6.17  0.1
silva-euk-18s-id95.fasta      2700646386527218729   13,259,584     11259.34   187.66  3.13
silva-euk-28s-id98.fasta      1845323523482939374   14,945,070      4182.19    69.70  1.16
rfam-5s-database-id98.fasta   13019673092862722585   8,525,326      3263.54    54.39  0.90
rfam-5.8s-database-id98.fasta 2169995244134016533    2,280,449      3259.41    54.32  0.90
                                                    Total time (hr) for alignment:    14.5

Mostly the processing time is consistent with the reference file size, although position 5 and 6 differ dramatically in time in spite the size is about the same. This can be for different reasons, but one is your machine is busy with other tasks, which I cannot verify of course. Have you tried top during the processing to see how the resources of the machine behave?

QianqianJena commented 4 years ago

Hello Biocodz, Thanks for your information. My machine has 24 cores and there is no reason why I use it. It is the alignment who takes so much long time and I am wondering maybe there is parameters I can apply to increase the processing speed. If the command I use is fine, I will just increase the threads to speed the processing speed. Thanks for your feedback in time. Best regards, Qianqian

Young331 commented 4 years ago

Hi, I'm using sortmerna version 4.2.0 on an HPC platform. It also runs very slowly when I use "-a 72". I got nothing after 14 hours. So I'm trying to accelerate the process. I use the parameter "-a 720". But I don't know why it is still very slow. Even after 1 day, there is NO OUTPUT at all. It stops at "testing file. alligned_fwd.fq" Is it right?   Due to time limit, I can only run one work within 7 days on the HPC. I have 18 samples. Now I want to know How much time it needs for one sample. Did I do something wrong? Why did it take so much time? Could you check my command line? Please let me know if you have any suggestion for my work.   Thanke you! This is the command line I am using: sortmerna --ref $PyPackages/sortmerna/data/rRNA_databases/silva-bac-16s-id90.fasta --ref $PyPackages/sortmerna/data/rRNA_databases/silva-arc-16s-id95.fasta --ref $PyPackages/sortmerna/data/rRNA_databases/silva-euk-18s-id95.fasta --ref $PyPackages/sortmerna/data/rRNA_databases/silva-bac-23s-id98.fasta --ref $PyPackages/sortmerna/data/rRNA_databases/silva-arc-23s-id98.fasta --ref $PyPackages/sortmerna/data/rRNA_databases/silva-euk-28s-id98.fasta --ref $PyPackages/sortmerna/data/rRNA_databases/rfam-5.8s-database-id98.fasta --ref $PyPackages/sortmerna/data/rRNA_databases/rfam-5s-database-id98.fasta --reads RNA-1117-00_1_5-120_p.fq --reads RNA-1117-00_2_5-120_p.fq --fastx FASTQ --kvdb RNA-1117-00_5-120 --aligned RNA-1117-00_5-120_alligned --other --paired_in --num_alignments 1 -a 720 -v --workdir /nesi/nobackup/uoa02698/new_all_ORF --out2

kvdb Output file: RNA-1117-00_5-120/ -rw-r-----+ 1 ywu580  65M Mar 29 00:04 000024.sst -rw-r-----+ 1 ywu580 35M Mar 29 00:04 000025.sst -rw-r-----+ 1 ywu580 416K Mar 29 00:05 000026.log -rw-r-----+ 1 ywu580  14M Mar 29 00:06 000027.sst -rw-r-----+ 1 ywu580 16 Mar 28 23:48 CURRENT -rw-r-----+ 1 ywu580  37 Mar 28 23:45 IDENTITY -rw-r-----+ 1 ywu580  0 Mar 28 23:45 LOCK -rw-rw----+ 1 ywu580 911K Mar 30 00:45 LOG -rw-r-----+ 1 ywu580  886 Mar 29 00:06 MANIFEST-000008 -rw-r-----+ 1 ywu580 5.0K Mar 28 23:45 OPTIONS-000005   These files are all empty. other_fwd.fq other_rev.fq RNA-1117-00_5-120_alligned_fwd.fq RNA-1117-00_5-120_alligned_rev.fq RNA-1117-00_5-120_alligned.log   It stops here for a long time: [Index:116] Found 32 non-empty index files. Skipping indexing. [Index:117] TODO: a better validation using an index descriptor to decide on indexing [calculate:107] Starting statistics calculation on file: 'RNA-1117-00_1_5-120_p.fq'  ...   [calculate:225] Done statistics on file. Elapsed time: 27.13 sec. all_reads_count= 55180977 [calculate:107] Starting statistics calculation on file: 'RNA-1117-00_2_5-120_p.fq'  ...   [calculate:225] Done statistics on file. Elapsed time: 27.09 sec. all_reads_count= 110361954 [store_to_db:421] Stored Reads statistics to DB:     min_read_len= 120 max_read_len= 120 all_reads_count= 110361954 all_reads_len= 13243434480 total_reads_mapped= 0 total_reads_mapped_cov= 0 reads_matched_per_db= TODO is_total_reads_mapped_cov= 0 is_stats_calc= 0   [init:101] Testing file: "/scale_wlg_nobackup/filesets/nobackup/new_all_ORF/RNA-1117-00_5-120_alligned_fwd.fq" [init:101] Testing file: "/scale_wlg_nobackup/filesets/nobackup/new_all_ORF/RNA-1117-00_5-120_alligned_rev.fq" [init:131] Testing file: "/scale_wlg_nobackup/filesets/nobackup/new_all_ORF/other_fwd.fq" [init:131] Testing file: "/scale_wlg_nobackup/filesets/nobackup/new_all_ORF/other_rev.fq" [init:218] Testing file: "/scale_wlg_nobackup/filesets/nobackup/new_all_ORF/RNA-1117-00_5-120_alligned.log"

biocodz commented 4 years ago

@Young331 Your process is stuck at the initiation phase, and hasn't even started the alignment.

If you don't see a line ==== Starting alignment ==== in the output trace, and the trace is not progressing, and doesn't show the number of reads processed - don't wait. Kill it and log an issue.

Sortmerna stores all the calculations in a database, so the output files will be empty until the processing finished.

Can you send me the complete execution trace like the one here? You can use biocodz at protonmail dot com

One problem in your command line: --fastx FASTQ. fastx is boolean option i.e. doesn't take an argument

I tested your command (modified to use my local directory structure):

sortmerna --ref sortmerna/data/rRNA_databases/silva-bac-16s-id90.fasta --ref sortmerna/data/rRNA_databases/silva-arc-16s-id95.fasta --ref sortmerna/data/rRNA_databases/silva-euk-18s-id95.fasta --ref sortmerna/data/rRNA_databases/silva-bac-23s-id98.fasta --ref sortmerna/data/rRNA_databases/silva-arc-23s-id98.fasta --ref sortmerna/data/rRNA_databases/silva-euk-28s-id98.fasta --ref sortmerna/data/rRNA_databases/rfam-5.8s-database-id98.fasta --ref sortmerna/data/rRNA_databases/rfam-5s-database-id98.fasta --reads sortmerna/data/set4_mate_pairs_metatranscriptomics_1.fastq.gz --reads sortmerna/data/set4_mate_pairs_metatranscriptomics_2.fastq.gz --fastx --kvdb sortmerna/run/RNA-1117-00_5-120 --aligned RNA-1117-00_5-120_alligned --other --paired_in --num_alignments 1 -v --workdir sortmerna/run --out2

...
[init:101] Testing file: "/home/biocodz/RNA-1117-00_5-120_alligned_fwd.fastq"
[init:101] Testing file: "/home/biocodz/RNA-1117-00_5-120_alligned_rev.fastq"
[init:131] Testing file: "/home/biocodz/other_fwd.fastq"
[init:131] Testing file: "/home/biocodz/other_rev.fastq"
[init:218] Testing file: "/home/biocodz/RNA-1117-00_5-120_alligned.log"

[align:359] ==== Starting alignment ====

[align:369] Using default number of Processor threads equals num CPU cores: 8
Number of cores: 8 Read threads:  1 Write threads: 1 Processor threads: 8
[ThreadPool:36] initialized Pool with: [10] threads
...
[align:462] Done index 0 Part: 1 Time: 1.24 sec
...
[align:462] Done index 1 Part: 1 Time: 0.67 sec
...
[align:462] Done index 7 Part: 1 Time: 0.76 sec

[align:469] ==== Done alignment ====

...
[generateReports:1120] === Done Reports generation ===

[~ReadsQueue:68] Destructor called on write_queue  recs.size= 0 pushed: 0  popped: 0
[~ReadsQueue:68] Destructor called on read_queue  recs.size= 0 pushed: 80000  popped: 80000
Thread  140051409258240 job done
Thread  140051417650944 job done
[closefiles:885] Flushed and closed
Young331 commented 4 years ago

slurm-10690575.txt Because I ran the same sample. So before this, I didn't delete the folder idx. Yes, it didn't start. Maybe I need to delete the folder idx and try again. I start it again. I will update the progress.

same problem. stop at "testing file......" Please check the second trace file: slurm-10701575.txt

biocodz commented 4 years ago

You don't need to delete idx. I think the problem is fastx as mentioned in my previous message - P.S. No, this is actually Not a problem. The extraneous argument is simply ignored

You can try the command I used in my previous message to confirm it works. Then compare it to yours.

It appears there is a problem creating the directories/files in your case, although it's not clear why an error is not thrown

johanneswerner commented 4 years ago

I have exactly the same problem. After indexing, one cpu is active with 100% (even though I have 28 available) and 0% memory usage but it does not show any output.

Here is my command:

sortmerna \
  --ref ../../rRNA_Method/rRNA_databases/silva-arc-16s-id95.fasta \
  --ref ../../rRNA_Method/rRNA_databases/silva-bac-16s-id90.fasta \
  --reads ../../rRNA_Method/R1.fastq.gz \
  --reads ../../rRNA_Method/R2.fastq.gz \
  --workdir workdir \
  --fastx \
  --aligned \
  --other \
  --best 1 \
  --paired_in \
  --threads 28 \
  -v

Can you see any problems?

Thank you!

EDIT: the problem does not occur with v. 2.1b (I used that version as this is the next lower version available in bioconda).

biocodz commented 4 years ago

The program is not running. Single 100% loaded CPU most likely means the process main thread is stuck in a loop. I would need to see the complete execution trace i.e. whatever is printed on the screen.

johanneswerner commented 4 years ago

@biocodz Thank you very much for your help.

input files compressed with gzip

4.5G R26_S25_L005_R1_001.fastq.gz (268058796 lines) 4.6G R26_S25_L005_R2_001.fastq.gz (268058796 lines)

terminal output:

sortmerna   --ref ../../rRNA_Method/rRNA_databases/silva-arc-16s-id95.fasta   
--ref ../../rRNA_Method/rRNA_databases/silva-bac-16s-id90.fasta   --reads ../../rRNA_Method/R26_S25_L005_R1_001.fastq.gz   --reads ../.
./rRNA_Method/R26_S25_L005_R2_001.fastq.gz   --workdir workdir   --fastx   --aligned   --other   --best 1   --paired_in   --threads 28 
  -v

[process:1369] === Options processing starts ... ===

Found value: sortmerna
Found flag: --ref
Found value: ../../rRNA_Method/rRNA_databases/silva-arc-16s-id95.fasta of previous flag: --ref
Found flag: --ref
Found value: ../../rRNA_Method/rRNA_databases/silva-bac-16s-id90.fasta of previous flag: --ref
Found flag: --reads
Found value: ../../rRNA_Method/R26_S25_L005_R1_001.fastq.gz of previous flag: --reads
Found flag: --reads
Found value: ../../rRNA_Method/R26_S25_L005_R2_001.fastq.gz of previous flag: --reads
Found flag: --workdir
Found value: workdir of previous flag: --workdir
Found flag: --fastx
Previous flag: --fastx is Boolean. Setting to True
Found flag: --aligned
Previous flag: --aligned is Boolean. Setting to True
Found flag: --other
Previous flag: --other is Boolean. Setting to True
Found flag: --best
Found value: 1 of previous flag: --best
Found flag: --paired_in
Previous flag: --paired_in is Boolean. Setting to True
Found flag: --threads
Found value: 28 of previous flag: --threads
Found flag: -v
[opt_workdir:1066] Using WORKDIR: ["/data/folder/repo/sortmerna_test/workdir"] as specified
process:1453] Processing option: aligned with value: 
[opt_aligned:256] Directory and Prefix for the aligned output was not provided. Using default dir/pfx: 'WORKDIR/out/aligned'
[process:1453] Processing option: best with value: 1
[process:1453] Processing option: fastx with value: 
[process:1453] Processing option: other with value: 
[opt_other:285] other was specified without argument. Will use default Directory and Prefix for the non-aligned output.
[process:1453] Processing option: paired_in with value: 
[process:1453] Processing option: reads with value: ../../rRNA_Method/R26_S25_L005_R1_001.fastq.gz
[opt_reads:73] Processing reads file [1] out of total [2] files
[process:1453] Processing option: reads with value: ../../rRNA_Method/R26_S25_L005_R2_001.fastq.gz
[opt_reads:73] Processing reads file [2] out of total [2] files
[process:1453] Processing option: ref with value: ../../rRNA_Method/rRNA_databases/silva-arc-16s-id95.fasta
[opt_ref:166] Processing reference [1] out of total [2] references
[opt_ref:220] File ["/data/folder/repo/sortmerna_test/../../rRNA_Method/rRNA_databases/silva-arc-16s-id95.fasta"] exists and is readable
[process:1453] Processing option: ref with value: ../../rRNA_Method/rRNA_databases/silva-bac-16s-id90.fasta
[opt_ref:166] Processing reference [2] out of total [2] references
[opt_ref:220] File ["/data/folder/repo/sortmerna_test/../../rRNA_Method/rRNA_databases/silva-bac-16s-id90.fasta"] exists and is readable
[process:1453] Processing option: threads with value: 28
[process:1453] Processing option: v with value: 

[process:1473] === Options processing done ===

[validate_kvdbdir:1252] Key-value DB location "/data/folder/repo/sortmerna_test/workdir/kvdb"
[validate_kvdbdir:1288] Creating KVDB directory: "/data/folder/repo/sortmerna_test/workdir/kvdb"
[validate_aligned_pfx:1307] Checking output directory: "/data/folder/repo/sortmerna_test/workdir/out"

WARNING: [validate:1557] 'best' [INT] has been set but no output format has been chosen (--blast | --sam | --otu_map). Using default 'b
last'

  Program:      SortMeRNA version 4.2.0
  Copyright:    2016-2020 Clarity Genomics BVBA:
                Turnhoutseweg 30, 2340 Beerse, Belgium
                2014-2016 Knight Lab:
                Department of Pediatrics, UCSD, La Jolla
                2012-2014 Bonsai Bioinformatics Research Group:
                LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
  Disclaimer:   SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
                implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
                See the GNU Lesser General Public License for more details.
  Contributors: Jenya Kopylova   jenya.kopylov@gmail.com
                Laurent Noé      laurent.noe@lifl.fr
                Pierre Pericard  pierre.pericard@lifl.fr
                Daniel McDonald  wasade@gmail.com
                Mikaël Salson    mikael.salson@lifl.fr
                Hélène Touzet    helene.touzet@lifl.fr
                Rob Knight       robknight@ucsd.edu

[main:63] Running command:
sortmerna --ref ../../rRNA_Method/rRNA_databases/silva-arc-16s-id95.fasta --ref ../../rRNA_Method/rRNA_databases/silva-bac-16s-id90.fas
ta --reads ../../rRNA_Method/R26_S25_L005_R1_001.fastq.gz --reads ../../rRNA_Method/R26_S25_L005_R2_001.fastq.gz --workdir workdir --fa
stx --aligned --other --best 1 --paired_in --threads 28 -v 

  Parameters summary: 
    K-mer size: 19
    K-mer interval: 1
    Maximum positions to store per unique K-mer: 10000

  Total number of databases to index: 2

[build_index:1189] Begin indexing file ../../rRNA_Method/rRNA_databases/silva-arc-16s-id95.fasta of size: 3893959 under index name work
dir/idx/3436099190853847617
  Collecting nucleotide distribution statistics ..  done  [0.029558 sec]

  start index part # 0: 
    (1/3) building burst tries .. done  [1.788349 sec]
    (2/3) building CMPH hash .. done  [3.443561 sec]
    (3/3) building position lookup tables .. done [4.097246 sec]
    total number of sequences in this part = 3193
      temporary file was here: workdirsortmerna_keys_26593.txt
      writing kmer data to workdir/idx/3436099190853847617.kmer_0.dat

      writing burst tries to workdir/idx/3436099190853847617.bursttrie_0.dat
      writing position lookup table to workdir/idx/3436099190853847617.pos_0.dat
      writing nucleotide distribution statistics to workdir/idx/3436099190853847617.stats
    done.

[build_index:1189] Begin indexing file ../../rRNA_Method/rRNA_databases/silva-bac-16s-id90.fasta of size: 19437013 under index name wor
kdir/idx/15734375058464002811
  Collecting nucleotide distribution statistics ..  done  [0.161797 sec]

  start index part # 0: 
    (1/3) building burst tries .. done  [15.391355 sec]
    (2/3) building CMPH hash .. done  [15.824412 sec]
    (3/3) building position lookup tables .. done [69.375532 sec]
    total number of sequences in this part = 12798
      temporary file was here: workdirsortmerna_keys_26593.txt
      writing kmer data to workdir/idx/15734375058464002811.kmer_0.dat

      writing burst tries to workdir/idx/15734375058464002811.bursttrie_0.dat
      writing position lookup table to workdir/idx/15734375058464002811.pos_0.dat
      writing nucleotide distribution statistics to workdir/idx/15734375058464002811.stats
    done.

and since that not much has happened.

If there is any additional information I can provide for you, please let me know.

biocodz commented 4 years ago

The program has a problem reading your .gz files. What's the output of

gzip --version
gzip -l yourfile.gz

Similar problem recently solved issue 221 How were your files created?

johanneswerner commented 4 years ago

Hm, something is indeed very fishy:

$ gzip --version
gzip 1.6
Copyright (C) 2007, 2010, 2011 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Written by Jean-loup Gailly.

$ gzip -l ../../rRNA_Method/R26_S25_L005_R1_001.fastq.gz
         compressed        uncompressed  ratio uncompressed_name
         4767463884               21442 -22234131.2% ../../rRNA_Method/R26_S25_L005_R1_001.fastq

$ gzip -l ../../rRNA_Method/R26_S25_L005_R2_001.fastq.gz
         compressed        uncompressed  ratio uncompressed_name
         4866268810               22002 -22117292.9% ../../rRNA_Method/R26_S25_L005_R2_001.fastq

EDIT: I did not generate the sequence data by myself, hence I don't know how they were generated.

biocodz commented 4 years ago

what's the output of

gzip -l --verbose yourfile.gz
johanneswerner commented 4 years ago
$ gzip -l --verbose ../../rRNA_Method/R26_S25_L005_R1_001.fastq.gz
gzip: ../../rRNA_Method/R26_S25_L005_R1_001.fastq.gz: extra field of 6 bytes ignored
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 2da034f2 Jun 24 16:00          4767463884               21442 -22234131.2% ../../rRNA_Method/R26_S25_L005_R1_001.fastq

$ gzip -l --verbose ../../rRNA_Method/R26_S25_L005_R2_001.fastq.gz
gzip: ../../rRNA_Method/R26_S25_L005_R2_001.fastq.gz: extra field of 6 bytes ignored
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 8eb0f23b Jun 24 16:02          4866268810               22002 -22117292.9% ../../rRNA_Method/R26_S25_L005_R2_001.fastq

thank you for your help btw

biocodz commented 4 years ago

Could you try to recompress the files as per man gzip:

gzip -cd old.gz | gzip > new.gz
or
gzip --keep -cd old.gz | gzip > new.gz  # keep is for keeping the input
johanneswerner commented 4 years ago

seems to run smoothly now, thank you very much.

I admit, I would not have anticipated that the compression was the problem. Thank you very much and happy easter. :-)

biocodz commented 4 years ago

Glad to hear. Happy Easter!

We need to add a better handling of such cases. I'll look into it

natar210 commented 4 years ago

Dear @biocodz

I am writing with regards to the running time of sortmeRNA. I have 300 human RNA-seq samples (~ 39.7 million read pairs). Each file is roughly 1GB (so 2 GB for reverse and forward reads). It's taking nearly a day to run a one-sample, is that normal?

Here is my code $HOME/sortmerna/bin/sortmerna --ref $HOME/sortmerna/data/rRNA_databases/rfam-5.8s-database-id98.fasta \ --ref $HOME/sortmerna/data/rRNA_databases/rfam-5s-database-id98.fasta \ --ref $HOME/sortmerna/data/rRNA_databases/silva-euk-18s-id95.fasta --ref $HOME/sortmerna/data/rRNA_databases/silva-euk-28s-id98.fasta \ --reads $file/$name"_R1.fq.gz" --reads $file/$name"_R2.fq.gz" \ --fastx --kvdb workdir1/$name \ --aligned $name"_rRNA" --other $name"_other" --paired_in --num_alignments 1 -v --workdir workdir1 -out2

Also attaching the log file here slurm-176630.txt

biocodz commented 4 years ago

Please, refer to Issue 231 Unfortunately slow running is a Bug caused by inefficient use of threads in 4.2.0 (Race condition). When the number of the worker threads exceeds a certain level (dependent on the hardware), the threads get stuck in context switching waiting for a read to process. Using less threads (e.g. -threads 30) may improve the runtime. We are currently testing a number of solutions including lockless queues and spin locks. This is our highest priority.

iquasere commented 3 years ago

SortMeRNA is still at version 4.2.0. When is a fast version (at least without the data corruption of 2.1b and the same speed) expected to come out? I love the tool but it has these persisting problems that damage its use...

biocodz commented 3 years ago

The version 4.3.1 is currently being tested and shows very good results so far. The new version also introduces many nice features like generating gzipped output, allowing to run indexing independently of the alignment, and others. We skipped release 4.2.1 because the test results were not good enough, after which we completely re-designed the task parallelization pipeline. Hopefully it will be out of the door in a week or two. Also it will take another week or so to provide the Conda installation because of the approal procedure on the part of the Biocore team.

biocodz commented 3 years ago

all tests are done now. Preparing the release documentation and the Conda recipe. Few more days...