novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
110 stars 31 forks source link

Memory requirements of TSV_to_Variants_Freq.py3 #43

Closed tleonardi closed 4 years ago

tleonardi commented 4 years ago

Hi, I've been trying to run Epinano (v1.1) following the procedure outlined in the wiki. I'm running TSV_to_Variants_Freq.py3 as a batch job on LSF, but I've noticed that it uses a lot of memory. In my last try it used over 80GB before getting killed by the scheduler. The input file is 48GB and I'm running the script with -t 2. Is the use of so much RAM expected? Do you have guidelines on the amount needed so that I can reserve it with the scheduler? Thanks!

Huanle commented 4 years ago

Hi @tleonardi , In theory, they maximum memory should be equal to the total size of intermediate *.freq files from the temporary folder which can be saved if you switch on the -k option. Can you check whether it makes sense in your case? Thanks a lot.

image

tleonardi commented 4 years ago

Hi @Huanle, thanks for that. The tmp_splitted folder contains 537 .freq files and 537 .tsv file. The total size of the files is 50GB, which is slightly more than the size of the tsv from sam2tsv (48GB). However, the script uses way more memory than that. Also I have noticed that the output folder contain the files .per_rd_var.5mer.csv and .per_rd_var.csv, but they are both empty. It seems this is the step where process is killed for using too much memory.. do you have any hints?

Huanle commented 4 years ago

Hi @tleonardi , thanks for bringing this out. I will investigate it. But in the meanwhile, maybe you can try split your input tsv file on references and run the script on each of them independently. By the way, you can also choose not to generate the per-read data if the --per_read_stats is not switched on.

enovoa commented 4 years ago

hi @tleonardi someone else has also reported a similar issue recently. Did the solution that @Huanle suggest work for you? Thank you

tleonardi commented 4 years ago

Hi @enovoa, I tried running TSV_to_Variants_Freq.py3 without --per_read_stats, but it didn't help. What did the trick was to filter the BAM file to only retain high expression transcripts (coverage >500).

enovoa commented 4 years ago

thanks @tleonardi for the clarification and the info :)