Closed Celinet21 closed 4 years ago
Hi Celinet21,
I ran into this error only once.
It is related to python's picking based communication protol between processes.
Python3.8 seems to have solved this issue. have a look at this .
Can you give it a go with python3.8 and let me know if this error still pops up?
python3.8 -m venv epinano12_py38 # python3.8.5
source epinano12_py38/bin/activate
pip install pandas
pip install dask==2.5.2
python -m pip install "dask[dataframe]"
pip install pysam
I tried the above and suceeded.
In the meanwhile, you can split your bam file on reference IDs and run the commands parallelly.
samtools view -hb your.bam refid1 > refid1.bam
samtools view -hb your.bam refid2 > refid2.bam
...
This way, with parallel commands and multiple processes, it should speed up the analysis quite a lot.
I was previously using a python 3 version <3.8, and using >=3.8 didn't show that error at all, thank you!
And thank you for that suggestion for speeding things up - splitting the bam file. So the idea is to do that, then index each file after splitting... run Epinano_Variants.py on each split bam file (parallel)... then at the end combine all result xmer.csv into one.
Or I suppose I should continue with Epinano_Predict.py before combining. Then finally, I can combine at the end with no issues?
I also am wondering if you have suggestions on recommend memory to give when running epinano? I also see the example commands give 6 threads as an argument.
Thanks!
FYI, a couple of my tmp files created by Epinano_Variants (e.g ... .per_site_var.5mer.tmp) have a line like:
ACAGN,17488366-17488370,+2,Null
Causing an index out of bound error at:
File "epinano_modules.py", line 869, in slide_per_site_var
window = (ary[0], ary[1], ary[3], ary[6])
IndexError: list index out of range
I was previously using a python 3 version <3.8, and using >=3.8 didn't show that error at all, thank you!
And thank you for that suggestion for speeding things up - splitting the bam file. So the idea is to do that, then index each file after splitting... run Epinano_Variants.py on each split bam file (parallel)... then at the end combine all result xmer.csv into one.
Or I suppose I should continue with Epinano_Predict.py before combining. Then finally, I can combine at the end with no issues?
I also am wondering if you have suggestions on recommend memory to give when running epinano? I also see the example commands give 6 threads as an argument.
Thanks!
You can do the cmbination either before or after making prediction. Both are fine. As for memory usage, the most ram-consuming step should be combining all small chunks in the tmp folder, this is fulfilled with dask, which handles large files well throough its lazy laoding trick. But honestly, i have not tested it extensively to determine relationship between input size and memory requirements.
That said, here i attached the memory profiles for processing reads mapped to 1chromosome and all chromsomes(16). You can see the difference is obvious.
Number of threads useed was 8!
Hope this helps.
Thanks @Huanle a lot!! Very helpful, and everything seems to be working and producing results thanks to your help.
I might open another issue for this other error I'm getting though, I'm not sure if it's an issue or not, or I can just ignore.
I'm using long-read Oxford nanopore dRNA data. Epinano 1.2 svm, running Epinano_Variants.py.
I found I ran into a python memory issue when using the default num_reads_per_chunk (even with the latest update of =1000). So I've changed it to 100 and seem to not see the error. The error was this:
However, it's very slow. I just wanted to double check that what I did was a good or bad idea - and if it's slowness is unavoidable? Or should I have done something else?
ps = Process (target = split_tsv_for_per_site_var_freq, args = (tsv_gen, q, number_threads, 100))
Thanks!!