MemoryError - Githubissues

moshl commented 5 years ago

hi,Blatte. I have a ampliseq data from illumina platform and try to call ITD by getitd, running the code:$python3 $getitd -reference $itdref -anno $itdanno GSH \ $I/Nova373-RDMT09-A144V1-PM2-20190920-E03_L3_U88V17_R1.fastq \ $I/Nova373-RDMT09-A144V1-PM2-20190920-E03_L3_U88V17_R2.fastq

But now i meet the problem: -- Reading FASTQ files -- Reading FASTQ files took 65.20987079106271 s Number of total reads: 3181110 Exception in thread Thread-6: Traceback (most recent call last): File "/p200/liuxin_group/moshl/software/python3/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/p200/liuxin_group/moshl/software/python3/lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File "/p200/liuxin_group/moshl/software/python3/lib/python3.5/multiprocessing/pool.py", line 429, in _handle_results task = get() File "/p200/liuxin_group/moshl/software/python3/lib/python3.5/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/p200/liuxin_group/moshl/software/python3/lib/python3.5/multiprocessing/connection.py", line 411, in _recv_bytes return self._recv(size) File "/p200/liuxin_group/moshl/software/python3/lib/python3.5/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) MemoryError

Please give me some advises. Thanks a lot. Mo

tjblaette commented 5 years ago

Hi Mo, sorry to get back to you only now. To be honest, I have never encountered this error myself. Do you get it for all of your samples or just this one? How much memory do you have available / what machine are you trying to run this on? How much memory is getITD using for this sample?

Cheers, TJ

moshl commented 4 years ago

thx, i solve the "Memory Error". Can getitd call ITD from IonTorrent amplicon data? I see the workflow, you repeatedly mentioned: uique reads that mean deduplication? Because, the amplicon data should not deduplication when we detect mutation. So, I should how to set some parameters for avoid the problem?

tjblaette commented 4 years ago

Hi Mo,

I'm glad you solved the memory problem. What exactly was causing the issue and how did you solve it? This could be useful for future reference for other users.

As for the IonTorrent data, I am not sure because I have never worked with it myself. I would give it a try. You might have to play with the minimum alignment score cutoff, if the error rate is much higher than for Illumina and if you accordingly see more alignments being discarded at this step. I am not sure about the adapter and primer filters, as this will depend on the library prep / design. You may need to set infer_sense_from_alignment to True. Maybe give it a try and let me know how it goes. I'll be happy to help with any troubleshooting.

Regarding the duplicates, this is not a problem. In fact, getITD was developed in conjunction with an amplicon assay so that works perfectly fine. Duplicate sequences are dealed with in two ways:

Unique reads, meaning non-duplicate sequences, are by default discarded (via the -min_read_copies parameter). We do this because we assume that any sequence that is true and clinically relevant will be present at least twice in a given amplicon-based sequencing sample. This is essentially the opposite of the typical "deduplication".
Duplicate reads are processed in groups to speed up the analysis. Most of our reads in an amplicon design will be duplicates of the WT sequence. Instead of processing / aligning each of these individually, producing the exact same result over and over again, only one read from each group of duplicates is processed once and the resulting alignment is saved for all of the reads with the same sequence. The duplicates are not discarded or "deduplicated" though. They are simply processed in a way that saves computation time.

Amplicon assays per se will thus work with default parameters. If, on the other hand, you have targeted enrichment data or reads of different lengths, parameters will need to be adjusted - in both cases, reads will mostly be "unique" and would be discarded using default parameters. Simply set -min_read_copies to 1 to prevent this. For the IonTorrent data you will have to test and see, as I have no experience with this. If you cannot get it to work at all, feel free to send me your getITD input files (small FASTQ, reference, annotation file, primer / adapter sequences) so that I can take a direct look myself.

Regards, TJ

moshl commented 4 years ago

Thank you so much, TJ. In fact, I don't know the primer of IonTorrent amplicon, I just used the primer for PCR-CE. And the ref and annotation file from the product of the primer. You say minimum alignment score cutoff was set, I donot know 20 is right or not. I share the IonTorrent data. The below is my code, /p200/liuxin_group/moshl/software/python3/bin/python3 /p200/liuxin_group/moshl/software/getitd-master/getitd.py AML184681 ../Data/AML184681_chr13.fq -technology 454 -min_bqs 20 -filter_ins_unique_reads 1 -min_read_copies 1 -reference /p200/liuxin_group/moshl/software/getitd-master/anno/iamplicon.txt -anno /p200/liuxin_group/moshl/software/getitd-master/anno/iamplicon_Ion_kayser.tsv -minscore_alignments 0.2 -infer_sense_from_alignment True -require_indel_free_primers false -forward_primer tcattattctttcctctatctgcagaact -reverse_primer gcaaacagtaaccattaaaaggatgga

However, I still be confused the problems: ① -require_indel_free_primers : it get different counts of the WT sequence from T and F. Why? ② how to filtering the false positive ITD? ③ I also run the illumina data（PE150）， it always gets " NO READS TO PROCESS! " why? I try to merge the R1 and R2，It don't work! And I also the set lower or higher minscore_alignments, It also don't work! Ps. fastq exchanged from mapped bam file

thanks again! look forward to your reply. MO AML184681_chr13.zip

tjblaette / getitd

MemoryError #1