smith-chem-wisc / Spritz

Software for RNA-Seq analysis to create sample-specific proteoform databases from RNA-Seq data
https://smith-chem-wisc.github.io/Spritz/
MIT License
7 stars 11 forks source link

Still running after five days of computation #210

Closed Stanley80 closed 3 years ago

Stanley80 commented 3 years ago

Dear Developers,

we installed the last version of Spritz from your repository and we are executing it on 18 RNA-Seq paired end files in a local cluster (64Gb RAM ...) running Centos. We launched Spritz two weeks ago using the following command:

snakemake -j 24 --resources mem_mb=40000 --verbose --printshellcmds --use-conda --keep-going

We also increased the amount of memory available for GATK (--java-options "-Xmx32000M).

Since 5 days Spritz is running the following commands without any error:

(gatk --java-options "-Xmx32000M -Dsamjdk.compression_level=9" AddOrReplaceReadGroups -PU platform -PL illumina -SM sample -LB library -I 2021-03-09/align/combined.sorted.bam -O 2021-03-09/variants/combined.sorted.grouped.bam -SO coordinate --TMP_DIR tmp && samtools index 2021-03-09/variants/combined.sorted.grouped.bam && gatk --java-options "-Xmx32000M -Dsamjdk.compression_level=9" MarkDuplicates -I 2021-03-09/variants/combined.sorted.grouped.bam -O 2021-03-09/variants/combined.sorted.grouped.marked.bam -M 2021-03-09/variants/combined.sorted.grouped.marked.metrics --TMP_DIR tmp -AS true && samtools index 2021-03-09/variants/combined.sorted.grouped.marked.bam) &> 2021-03-09/variants/combined.sorted.grouped.marked.log

Since we have not enough experience to say that everything is going fine, we would ask you if you believe that five days (h24) of computation is reasonable for completing these commands.

Thank you very much for your amazing work and support

acesnik commented 3 years ago

Hi @Stanley80,

Thanks for the message. Five days is definitely longer than I'd expect. It takes about 3 hours for this step to be completed for a dataset with ~3e8 reads. How many reads does your dataset have?

I'm curious if GATK ran out of memory for you before you increased the memory allocation. I usually see this step use ~10 GB at maximum.

If you share the log and benchmark files, I may be able to get a better idea of whether this is normal behavior or whether something is off.

Best regards,

Anthony

Stanley80 commented 3 years ago

Thank you Anthony for quick response.

Our dataset has 100 000 000 reads for R1 and 100 000 000 for R2 on average.

Please, find attached a zip file with the log and benchmark files you requested.

log and benchmark files.zip

Thank you again for support

acesnik commented 3 years ago

No problem!

Everything looks okay. It looks like it took a couple days for the previous step (AddOrReplaceReadGroups), which is a lot, but it did finish. Looking at the options used for it, I now see that I should remove -SO coordinate from that step, so that it doesn't sort again. I switched to using samtools beforehand because I've found it to be faster, but it looks like that option stuck around for some reason. Thanks for helping identify that!

There is some odd behavior in the MarkDuplicates step, where occasionally it takes a long time to match pairs from subsets of reads, e.g., the 240s it took for one of the batches below. I don't know the reason for that, but it does look like it's moving along. It's going at about half of the speed I would expect, unfortunately, but I would advise just letting it run. image

Based on the speed it is going, I would expect it to take another five days. But it does look like it's still churning along!

acesnik commented 3 years ago

I hope the rest of this run went okay for you. I'm going to close this issue for now. Please feel free to reopen it if you have any other issues.

Regarding my technical note above, I've removed -SO coordinate from the AddOrReplaceReadGroups rule in this PR, which will be the next one merged: https://github.com/smith-chem-wisc/Spritz/pull/211/files#diff-578e296792e2121effc08b67456c54db53abb933db14dc68691dd2b45fdd1132R49