long-read data causing memory/batch issue

Celinet21 commented 4 years ago

I'm using long-read Oxford nanopore dRNA data. Epinano 1.2 svm, running Epinano_Variants.py.

I found I ran into a python memory issue when using the default num_reads_per_chunk (even with the latest update of =1000). So I've changed it to 100 and seem to not see the error. The error was this:

Process Process-2: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, self._kwargs) File "EpiNano-Epinano1.2.0/epinano_modules.py", line 203, in split_tsv_for_per_site_var_freq q.put ((idx, chunk_out)) #.close()** File "", line 2, in put File "/usr/lib64/python3.6/multiprocessing/managers.py", line 756, in _callmethod conn.send((self._id, methodname, args, kwds)) File "/usr/lib64/python3.6/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/usr/lib64/python3.6/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647

However, it's very slow. I just wanted to double check that what I did was a good or bad idea - and if it's slowness is unavoidable? Or should I have done something else?

ps = Process (target = split_tsv_for_per_site_var_freq, args = (tsv_gen, q, number_threads, 100))

Thanks!!

Huanle commented 4 years ago

Hi Celinet21,

I ran into this error only once.
It is related to python's picking based communication protol between processes. Python3.8 seems to have solved this issue. have a look at this . Can you give it a go with python3.8 and let me know if this error still pops up?

python3.8 -m venv epinano12_py38  # python3.8.5
source epinano12_py38/bin/activate
pip install pandas
pip install  dask==2.5.2
python -m pip install "dask[dataframe]"
pip install pysam

I tried the above and suceeded.

In the meanwhile, you can split your bam file on reference IDs and run the commands parallelly.

samtools view -hb your.bam refid1 > refid1.bam
samtools view -hb your.bam refid2 > refid2.bam
...

This way, with parallel commands and multiple processes, it should speed up the analysis quite a lot.

Celinet21 commented 4 years ago

I was previously using a python 3 version <3.8, and using >=3.8 didn't show that error at all, thank you!

And thank you for that suggestion for speeding things up - splitting the bam file. So the idea is to do that, then index each file after splitting... run Epinano_Variants.py on each split bam file (parallel)... then at the end combine all result xmer.csv into one.

Or I suppose I should continue with Epinano_Predict.py before combining. Then finally, I can combine at the end with no issues?

I also am wondering if you have suggestions on recommend memory to give when running epinano? I also see the example commands give 6 threads as an argument.

Thanks!

Celinet21 commented 4 years ago

FYI, a couple of my tmp files created by Epinano_Variants (e.g ... .per_site_var.5mer.tmp) have a line like: ACAGN,17488366-17488370,+2,Null

Causing an index out of bound error at:

File "epinano_modules.py", line 869, in slide_per_site_var
    window = (ary[0], ary[1], ary[3], ary[6])       
IndexError: list index out of range

Huanle commented 4 years ago

I was previously using a python 3 version <3.8, and using >=3.8 didn't show that error at all, thank you!

And thank you for that suggestion for speeding things up - splitting the bam file. So the idea is to do that, then index each file after splitting... run Epinano_Variants.py on each split bam file (parallel)... then at the end combine all result xmer.csv into one.

Or I suppose I should continue with Epinano_Predict.py before combining. Then finally, I can combine at the end with no issues?

I also am wondering if you have suggestions on recommend memory to give when running epinano? I also see the example commands give 6 threads as an argument.

Thanks!

You can do the cmbination either before or after making prediction. Both are fine. As for memory usage, the most ram-consuming step should be combining all small chunks in the tmp folder, this is fulfilled with dask, which handles large files well throough its lazy laoding trick. But honestly, i have not tested it extensively to determine relationship between input size and memory requirements.

That said, here i attached the memory profiles for processing reads mapped to 1chromosome and all chromsomes(16). You can see the difference is obvious.
Number of threads useed was 8!

Supp-Fig1-a-resource_consumption_processing_one_chromosome consumption

Supp_Fig1-b-resource_consumption_processing_all_chromesome

Hope this helps.

Celinet21 commented 4 years ago

Thanks @Huanle a lot!! Very helpful, and everything seems to be working and producing results thanks to your help.

I might open another issue for this other error I'm getting though, I'm not sure if it's an issue or not, or I can just ignore.

novoalab / EpiNano

long-read data causing memory/batch issue #59