rvolden / Mandalorion-Episode-II

Version II of Mandalorion
MIT License
32 stars 2 forks source link

MemoryError during defineAndQuantifyWrapper.py #6

Open ljw90607 opened 6 years ago

ljw90607 commented 6 years ago

Dear @christopher-vollmers ,

Hello. I have been trying to run the test-run of your script using the example script you have provided. Unfortunately I am getting the error shown below, and I cannot get the solution anywhere.

(most recent call last): File "defineAndQuantifyIsoforms.py", line 245, in main() File "defineAndQuantifyIsoforms.py", line 242, in main define_start_end_sites(start_end_dict, individual_path, subreads) File "defineAndQuantifyIsoforms.py", line 132, in define_start_end_sites positions = np.array(start_end_dict[identity]) MemoryError Traceback (most recent call last): File "createConsensi.py", line 212, in for line in open(path + '/isoform_list'): FileNotFoundError: [Errno 2] No such file or directory: '/data/ONT/Data/PM-AU-0002-N-A1//isoform_list'

I did some research on it and found that it might be because of 32-bit system of python. But I found that we had 64-bit system python installed in our system. Would there be any suggestion on how I could overcome this issue?

If you could help me with it, I would really appreciate it!

Thank you!

Jungwoo

ljw90607 commented 6 years ago

@rvolden If you could help me with this issue, I would really appreciate it!

rvolden commented 6 years ago

Can I have a little more info on the system you're running this on? If you're running into a memory error, it means python is trying to allocate more memory than the system has available since your python version is build for the correct architecture. Are you running this locally or on a server and how much RAM do you have?

ljw90607 commented 6 years ago

To @rvolden, I did run on linux server with RAM size of 264568344 kB total and free memory of 68946012 kB at the moment. I previously tried to find the solution for this issue, and found some comments about the input might be too big. But I do not think that the input is that large compare to the other datasets (SAM file at around 7GB).

code which I ran is shown as below: python3 defineAndQuantifyWrapper.py -c /data/ONT_RNA/analysis/mandalorion/defineAndQuantifyIsoforms/20180918_PM_AU_samples.txt -f /data/ONT_RNA/analysis/mandalorion/defineAndQuantifyIsoforms/contig_file.txt -p /data/ONT_RNA/Data/PM-AU-0002-N-A1/ -m NUC.4.4.mat -u 5 -d 30 -s 200 -g /data/ONT_RNA/reference/gencode.v28.annotation.gtf -r 0.05 -R 3 -i 0 -t 0 -I 100 -T 60

Thank you very much for your help.

Jungwoo

rvolden commented 6 years ago

We found out what the issue is and should be pushing a fix for it soon

ljw90607 commented 6 years ago

Dear @rvolden,

Thank you very much for your support. I'm really looking forward to hear back from you soon!

Jungwoo

ljw90607 commented 6 years ago

Dear @rvolden,

Did the previously found issue get fixed? I wasn't informed about it for awhile, so I was wondering! Thank you for your support!

rvolden commented 6 years ago

I think this issue has been fixed, please let us know if it's still occurring after updating

ljw90607 commented 6 years ago

Dear @rvolden,

Thank you for your wonderful help. This time actually previously found memory issue did not come up, but I still get the error message below,

<Traceback (most recent call last): File "createConsensi.py", line 212, in for line in open(path + '/isoform_list'): FileNotFoundError: [Errno 2] No such file or directory: '/data/ONT_RNA/Data/PM-AU-0002-N-A1/isoform_list'>

I checked each python scripts to see where in part the isoform_list needed to be created, but could not find it.

I used the same code as I have previously used. And, The error message came up after this processing log: 14_5l8458-3r8401~5l8469-3r8409~5l8456-3r8471~ 1 22_5l18590-3r18418~5l18625-3r18449~5l18582-3r18410~5l18589-3r18412~5l18656-3r18454~ 2 2_5l997-3r1042~3l1152-5r1171~ 11 GL000216.2_5l19146-3r18942~ 1

Thank you again for your wonderful help. Please let me know if you need more information about it.

Jungwoo

rvolden commented 6 years ago

Can you provide me with the command you used to run this? Including the paths provided

ljw90607 commented 6 years ago

Dear @rvolden,

I am actually trying out few different commands since I found out that the output "isoform" was created in incorrect folder. I will let you know how the output comes out. Thank you for your wonderful help.

ljw90607 commented 6 years ago

Dear @rvolen,

After I tried with new input, I ran into another problem which says below;

/data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/44/Isoform177860.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/44/Isoform177860_subreads.fastq 1_3l5571-5r5485~3l5577-5r5486~_165631238_165651231_22.0_81.0

/data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/44/Isoform177860.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/44/Isoform177860_subreads.fastq 1_3l5571-5r5485~3l5577-5r5486~_165631238_165651231_22.0_81.0 determine Traceback (most recent call last): File "createConsensi.py", line 218, in corrected_consensus, repeats = determine_consensus(name, fasta, fastq) File "createConsensi.py", line 147, in determine_consensus min(len(fastq_reads), subsample), replace=False) File "mtrand.pyx", line 1126, in mtrand.RandomState.choice ValueError: a must be non-empty

Here's all the commands including the paths and file formats. command used : python3 defineAndQuantifyWrapper.py -c /data/ONT_RNA/analysis/mandalorion/defineAndQuantifyIsoforms/20180918_PM_AU_samples.txt -f /data/ONT_RNA/analysis/mandalorion/defineAndQuantifyIsoforms/config_file.txt -p /data/ONT_RNA/Data/PM-AU-0002-N-A1/ -m NUC.4.4.mat -u 5 -d 30 -s 200 -g /data/ONT_RNA/reference/gencode.v28.annotation.gtf -r 0.05 -R 3 -i 0 -t 0 -I 100 -T 60

20180918_PM_AU_samples.txt : /data/ONT_RNA/Data/PM-AU-0002-N-1/20180817_colon_2N_minimap2.sorted.psl
/data/ONT_RNA/Data/PM-AU-0002-N-A1/20180817_colon_2N.fasta
/data/ONT_RNA/Data/PM-AU-0002-N-A1/
/data/ONT_RNA/Data/PM-AU-0002-N-A1/20180817_colon_2N.fastq
/data/ONT_RNA/Data/PM-AU-0002-N-A1/20180817_colon_2N_minimap2.sam

tab seperated

config.file.txt : minimap2 /data/ONT_RNA/download/minimap2/minimap2 racon /appl/racon/racon/build/bin consensus /data/ONT_RNA/download/mandalorion/Mandalorion-Episode-II-master poa /appl/poaV2/bio-pipeline-master/poaV2/poa water /usr/local/bin/water blat /appl/blat/blatSrc/bin/blat

Output created: parsed_reads/ SS.bed isoform_list

If you need more information for figuring out this issue, please let me know.

ljw90607 commented 6 years ago

Dear @rvolden, I was wondering if you could help me with the issue above. Thank you for your help.

rvolden commented 6 years ago

Sorry, I'm guessing I wasn't notified of the previous post because my username was misspelled. Looking at the error, it isn't getting a length for the fastq read list. Can you give me the contents of your isoform_list file please? If that's empty, then something is going wrong in defineAndQuantifyIsoforms.py.

Other things that I see from your config file: racon should be pointing to the executable for racon, not just the path to it. The same thing with consensus, it should be consensus /data/ONT_RNA/download/mandalorion/Mandalorion-Episode-II-master/consensus.py. For racon, it should be racon /appl/racon/racon/build/bin/racon.

ljw90607 commented 5 years ago

Dear @rvolden,

I do get the isoform_list file not empty. I also have fixed the config file list as you have mentioned, but still get the error as below.

"/data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/36/Isoform146777.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/36/Isoform146777_subreads.fastq 12__7107602_7108479_97.0_138.0

/data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/36/Isoform146777.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/36/Isoform146777_subreads.fastq 12__7107602_7108479_97.0_138.0 determine Traceback (most recent call last): File "createConsensi.py", line 218, in corrected_consensus, repeats = determine_consensus(name, fasta, fastq) File "createConsensi.py", line 147, in determine_consensus min(len(fastq_reads), subsample), replace=False) File "mtrand.pyx", line 1126, in mtrand.RandomState.choice ValueError: a must be non-empty"

Here is the content for the isoform_list file "/data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/44/Isoform177860.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed _reads/44/Isoform177860_subreads.fastq 1_3l5571-5r5485~3l5577-5r5486~_165631238_165651231_22.0_81.0 /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/33/Isoform132094.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed _reads/33/Isoform132094_subreads.fastq 646839269_46839551_358.5_44.0 /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/62/Isoform251429.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed _reads/62/Isoform251429_subreads.fastq 2059033437_59034830_81.0_80.0 /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/34/Isoform139658.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed _reads/34/Isoform139658_subreads.fastq 12_5l4355-3r4283~5l4344-3r4278~_71125089_71132799_83.5_18.0 /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/20/Isoform83933.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed _reads/20/Isoform83933_subreads.fastq 1_5l5732-3r5637~5l5754-3r5668~5l5729-3r5634~_150732607_150751870_522.0_83. 0" ... and more.

Since you have mentioned about not getting the lnegth from the fastq file so, I looked into fastq file and found out that there was no length information. It seems like there was no length information given in the raw fastq file when we processed. Would this be the issue? If so, would there be any way we can add the length information for each reads?

Thank you again for your wonderful help.

ljw90607 commented 5 years ago

@rvolden, Another thing I noticed is that all of the subreads.fastq output are empty, and only the fasta file has some contents. I am not sure if this could be the issue. Thank you for your help again.

rvolden commented 5 years ago

What I mean by not getting length information is that it isn't getting the reads. It's trying to take a subset of the reads with np.random.choice. This error can be reproduced with np.random.choice(np.arange(0, 0), 0, replace=False). This will only occur if the range of numbers it's given is empty (np.arange(0, len(fastq_reads))), so fastq_reads has to have a length of zero. This means when it's reading in the file and giving back the sequences, it's reading and giving back nothing. That means /data/ONT_RNA/Data/PM-AU-0002-N-A1//parsed_reads/44/Isoform177860_subreads.fastq is empty, and you noted that all of your subread files are empty.

This might have happened because you don't have a fastq file in your content file. Check your content file to ensure you have your fastq file in there, which should be the fourth tab separated item.

ljw90607 commented 5 years ago

Dear @rvolden, I double checked the content file and the fastq file does exist in the right place.

"" /data/ONT_RNA/Data/PM-AU-0002-N-A1/20180817_colon_2N_minimap2.sorted.psl /data/ONT_RNA/Data/PM-AU-0002-N-A1/20180817_colon_2N.fasta /data/ONT_RNA/Data/PM-AU-0002-N-A1/ /data/ONT_RNA/Data/PM-AU-0002-N-A1/20180817_colon_2N.fastq /data/ONT_RNA/Data/PM-AU-0002-N-A1/20180817_colon_2N_minimap2.sam ""

Would there be any other reason why this error could occur?

ljw90607 commented 5 years ago

To @rvolden,

When we used different dataset to run Mandalorion, it sill does not give us the subreads.fastq output. Would there be any test dataset that we can try to run this script properly?

wiedemak commented 5 years ago

Hey @rvolden and @ljw90607,

I had the same problem of empty Isoform*_subreads.fastq and found a reason. In my case, the input *.fastq reads look like this 4 lines:

@83cd2a45-5e8b-4f19-a5ae-db4d09d80c32 runid=6b149e596f92e00e5f8e09d68bbf440b4c863b82 read=104 ch=15 start_time=2018-01-01T01:02:52Z GTAGAGGATAGG[...] + )-2')%(')*(+[...]

In line 224 of defineAndQuantifyIsoforms.py the root_name of the read is parsed by: root_name = name[1:].split('_')[0]. However, there is no underscore in my read headers. Using blank space instead: root_name = name[1:].split(' ')[0] results in correct root_name's, the read_seq/subreads list is filled with more than just read names and the Isoform*_subreads.fastq files are not empty anymore.

Hope, this will help you, too.

rvolden commented 5 years ago

If that's the reason, then thank you. Unfortunately I don't have a test set to give. Both mine and Chris's versions of Mandalorion are designed to take R2C2 reads, which is why we split with an underscore rather than a space. Because of this I'm not sure what I should do in terms of support for both especially because I can't make changes to Chris's version.