novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)
GNU General Public License v2.0
110 stars 31 forks source link

long-read data causing memory/batch issue #59

Closed Celinet21 closed 4 years ago

Celinet21 commented 4 years ago

I'm using long-read Oxford nanopore dRNA data. Epinano 1.2 svm, running

I found I ran into a python memory issue when using the default num_reads_per_chunk (even with the latest update of =1000). So I've changed it to 100 and seem to not see the error. The error was this:

Process Process-2: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/", line 258, in _bootstrap File "/usr/lib64/python3.6/multiprocessing/", line 93, in run self._target(*self._args, self._kwargs) File "EpiNano-Epinano1.2.0/", line 203, in split_tsv_for_per_site_var_freq q.put ((idx, chunk_out)) #.close()** File "", line 2, in put File "/usr/lib64/python3.6/multiprocessing/", line 756, in _callmethod conn.send((self._id, methodname, args, kwds)) File "/usr/lib64/python3.6/multiprocessing/", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/usr/lib64/python3.6/multiprocessing/", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647

However, it's very slow. I just wanted to double check that what I did was a good or bad idea - and if it's slowness is unavoidable? Or should I have done something else?

ps = Process (target = split_tsv_for_per_site_var_freq, args = (tsv_gen, q, number_threads, 100))


Huanle commented 4 years ago

Hi Celinet21,

I ran into this error only once.
It is related to python's picking based communication protol between processes. Python3.8 seems to have solved this issue. have a look at this . Can you give it a go with python3.8 and let me know if this error still pops up?

python3.8 -m venv epinano12_py38  # python3.8.5
source epinano12_py38/bin/activate
pip install pandas
pip install  dask==2.5.2
python -m pip install "dask[dataframe]"
pip install pysam

I tried the above and suceeded.

In the meanwhile, you can split your bam file on reference IDs and run the commands parallelly.

samtools view -hb your.bam refid1 > refid1.bam
samtools view -hb your.bam refid2 > refid2.bam

This way, with parallel commands and multiple processes, it should speed up the analysis quite a lot.

Celinet21 commented 4 years ago

I was previously using a python 3 version <3.8, and using >=3.8 didn't show that error at all, thank you!

And thank you for that suggestion for speeding things up - splitting the bam file. So the idea is to do that, then index each file after splitting... run on each split bam file (parallel)... then at the end combine all result xmer.csv into one.

Or I suppose I should continue with before combining. Then finally, I can combine at the end with no issues?

I also am wondering if you have suggestions on recommend memory to give when running epinano? I also see the example commands give 6 threads as an argument.


Celinet21 commented 4 years ago

FYI, a couple of my tmp files created by Epinano_Variants (e.g ... .per_site_var.5mer.tmp) have a line like: ACAGN,17488366-17488370,+2,Null

Causing an index out of bound error at:

File "", line 869, in slide_per_site_var
    window = (ary[0], ary[1], ary[3], ary[6])       
IndexError: list index out of range
Huanle commented 4 years ago

I was previously using a python 3 version <3.8, and using >=3.8 didn't show that error at all, thank you!

And thank you for that suggestion for speeding things up - splitting the bam file. So the idea is to do that, then index each file after splitting... run on each split bam file (parallel)... then at the end combine all result xmer.csv into one.

Or I suppose I should continue with before combining. Then finally, I can combine at the end with no issues?

I also am wondering if you have suggestions on recommend memory to give when running epinano? I also see the example commands give 6 threads as an argument.


You can do the cmbination either before or after making prediction. Both are fine. As for memory usage, the most ram-consuming step should be combining all small chunks in the tmp folder, this is fulfilled with dask, which handles large files well throough its lazy laoding trick. But honestly, i have not tested it extensively to determine relationship between input size and memory requirements.

That said, here i attached the memory profiles for processing reads mapped to 1chromosome and all chromsomes(16). You can see the difference is obvious.
Number of threads useed was 8!

Supp-Fig1-a-resource_consumption_processing_one_chromosome consumption


Hope this helps.

Celinet21 commented 4 years ago

Thanks @Huanle a lot!! Very helpful, and everything seems to be working and producing results thanks to your help.

I might open another issue for this other error I'm getting though, I'm not sure if it's an issue or not, or I can just ignore.