mcveanlab / mccortex

De novo genome assembly and multisample variant calling
https://github.com/mcveanlab/mccortex/wiki
MIT License
113 stars 25 forks source link

out of memory #67

Closed jdmontenegro closed 6 years ago

jdmontenegro commented 6 years ago

Hi guys, First of all, thank you for this great software it looks really impressive. I would like to know if there is any workaround to an out of memory Fatal Error while threading the DBG. I have a 1.7Gbp genome sequenced to a coverage of ~60X. I am using the largest node I have available which has 1.5Tb of RAM, but still get the following error (I've removed full paths to make it simpler to read):

./mccortex 47 thread -m 1450G -t 128 -1 merged_ec.1.fastq -1 merged_ec.2.fastq -o nigel_links.se.ctp.gz nigel_dbg.popped.ctx
[19 Mar 2018 21:59:28-qak][cmd] /data/Bioinfo/bioinfo-proj-jmontenegro/Programs/mccortex/bin/mccortex63 thread -m 1450G -t 128 -1 merged_ec.1.fastq -1 merged_ec.2.fastq -o nigel_links.se.ctp.gz nigel_dbg.popped.ctx
[19 Mar 2018 21:59:28-qak][cwd] /data/Bioinfo/bioinfo-proj-jmontenegro/Programs/mccortex/bin
[19 Mar 2018 21:59:28-qak][version] mccortex=v0.0.3-554-ga7d6f3b zlib=1.2.3 htslib=1.3.2-208-gd8d0323 ASSERTS=ON hash=Lookup3 CHECKS=ON k=33..63
[19 Mar 2018 21:59:28-qak] Reading graph: nigel_dbg.popped.ctx
[19 Mar 2018 21:59:28-qak][task] input: merged_ec.1.fastq
[19 Mar 2018 21:59:28-qak]  FASTQ offset: auto-detect, threshold: off; cut homopolymers: off
[19 Mar 2018 21:59:28-qak]  one-way gap traversal [no edge check]
[19 Mar 2018 21:59:28-qak][task] input: merged_ec.2.fastq
[19 Mar 2018 21:59:28-qak]  FASTQ offset: auto-detect, threshold: off; cut homopolymers: off
[19 Mar 2018 21:59:28-qak]  one-way gap traversal [no edge check]
[19 Mar 2018 21:59:28-qak][memory] 456 bits per kmer
[19 Mar 2018 21:59:28-qak][memory] graph: 85.6GB
[19 Mar 2018 21:59:28-qak][memory] paths hash: 415.3GB
[19 Mar 2018 21:59:28-qak][memory] paths store: 949.2GB
[19 Mar 2018 21:59:28-qak][memory] total: 1.4TB of 1.5TB RAM
[19 Mar 2018 21:59:28-qak] Creating paths file: nigel_links.se.ctp.gz
[19 Mar 2018 21:59:28-qak][hasht] Allocating table with 1,610,612,736 entries, using 24.1GB
[19 Mar 2018 21:59:28-qak][hasht]  number of buckets: 33,554,432, bucket size: 48
[19 Mar 2018 21:59:28-qak][graph] kmer-size: 47; colours: 1; capacity: 1,610,612,736
[19 Mar 2018 21:59:28-qak][GPathSet] Allocating for 31,446,338,337 paths, 29.3GB colset, 351.4GB seq => 937.2GB total
[19 Mar 2018 21:59:28-qak][GPathHash] Allocating table with 45,097,156,608 entries, using 420.1GB
[19 Mar 2018 21:59:28-qak][GPathHash]  number of buckets: 1,073,741,824, bucket size: 42
[19 Mar 2018 22:02:47-qak] Not using new paths as they are added (safe)
[19 Mar 2018 22:02:47-qak][FileFilter] Reading file nigel_dbg.popped.ctx [1 src colour]
[19 Mar 2018 22:02:47-qak][GReader] 1,199,356,668 kmers, 23.5GB filesize
[19 Mar 2018 22:12:06-qak][GReader] Loaded 1,199,356,668 / 1,199,356,668 (100.00%) of kmers parsed
[19 Mar 2018 22:12:06-qak][hasht] buckets: 33,554,432 [2^25]; bucket size: 48;
[19 Mar 2018 22:12:06-qak][hasht] memory: 24.1GB; filled: 1,199,356,668 / 1,610,612,736 (74.47%)
[19 Mar 2018 22:12:06-qak][GPathStore] Creating separate read/write GraphPath linked lists
[19 Mar 2018 22:12:06-qak][asyncio] Inputs: 2; Threads: 128
[19 Mar 2018 22:12:06-qak][seq] Parsing sequence file merged_ec.1.fastq
[19 Mar 2018 22:12:06-qak][seq] Parsing sequence file merged_ec.2.fastq
[19 Mar 2018 22:12:06-qak] merged_ec.2.fastq: Qual scores: Sanger (Phred+33) [offset: 33, range: [33,73], sample: [43,73]]
[19 Mar 2018 22:12:07-qak] merged_ec.1.fastq: Qual scores: Sanger (Phred+33) [offset: 33, range: [33,73], sample: [44,73]]
[19 Mar 2018 22:18:46-qak][GenPaths] Read 5,000,000 entries (reads / read pairs)
(...)
[20 Mar 2018 03:14:35-qak][GenPaths] Read 395,000,000 entries (reads / read pairs)
[20 Mar 2018 03:16:33-qak][GPathSet] Paths: 19,725,722,745 / 31,446,338,337 [62.73%], seqs: 380.7GB / 380.7GB [100.00%] (408802399165 / 408802398397)
[20 Mar 2018 03:16:33-qak] 19725722744 >= 31446338337; 408802398561 > 408802398397 nbytes: 17
[20 Mar 2018 03:16:33-qak][gpath_set.c:203] Fatal Error: Out of memory

From the log, it looks like it should fit in memory, but it doesn't. Would it help if I reduced the number of reads to half (~30X) to reduce memory usage? Or if I split the fastq in two chunks and do a two step threading of SE information to reduce peak memory? Orthe only solution would be to ask for a larger node (more memory)?

I look forward to any suggestion.

Kind regards,

jdmontenegro commented 6 years ago

Hi guys, As a follow up. Using only half the reads in the SE threading step helped reduce the memory foorprint and the program is currently writing the SE.thread.ctp.gz file as we speak. I'm not sure how well memory is going to be managed for the PE threading.

I also saw it was possible to use PacBio reads for SE threading, would these be uncorrected/raw PB reads or do these need to be corrected before using them?

Regards,

winni2k commented 6 years ago

You could try using cortexjdk for this. Cortexjdk uses binary searches on the cortex graph for all operations and so does not need to load the graph into memory. You might have to contact Kiran to get him to expose that functionality to the command line, but I'm pretty sure the functionality is already there...

noporpoise commented 6 years ago

Are you reads very long? Did you clean the graph (.ctx file) before threading the reads? Unfortunately the graph threading step does require a large amount of memory. Writing links to memory instead of storing them in memory would fix this issue - it sounds like cortexjdk might do that.

Cleaning the de Bruijn graph or cleaning more aggressively may fix these issues.

winni2k commented 6 years ago

@jdmontenegro, to answer your question about threading of pacbio reads: My understanding is that you should not have to correct the pacbio reads before threading. In essence, threading is a way of correcting pacbio reads against illumina data. My understanding is also that mccortex will fill in gabs in the pacbio reads based on the graph if there are no junctions in that region of the graph.

Do make sure you pick a k-mer size that is appropriate for your type of pacbio sequencing. For example, k=47 with pacbio reads is only going to work with CCS reads.

jdmontenegro commented 6 years ago

Hi everyone, After extensive correspondence with Kiran Garimella who has been extremely helpful a few things are clear: 1) It is possible to divide the dataset into smaller chunks to reduce memory usage during threading. 2) threading with PB reads requires corrected reads This step is not intended for error correction but for cleaning up the DBG graph and improving the assembly.

BTW, I'm definitively going to try the cortexjdk, it seems it will be very helpful in this case.

Thank you for all your answers.

Regards,

Juan D. Montenegro

2018-04-28 17:00 GMT+10:00 Winni Kretzschmar notifications@github.com:

@jdmontenegro https://github.com/jdmontenegro, to answer your question about threading of pacbio reads: My understanding is that you should not have to correct the pacbio reads before threading. In essence, threading is a way of correcting pacbio reads against illumina data. My understanding is also that mccortex will fill in gabs in the pacbio reads based on the graph if there are no junctions in that region of the graph.

Do make sure you pick a k-mer size that is appropriate for your type of pacbio sequencing. For example , k=47 with pacbio reads is only going to work with CCS reads.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mcveanlab/mccortex/issues/67#issuecomment-385147081, or mute the thread https://github.com/notifications/unsubscribe-auth/AI8lule5c93Epe8Y_ljiWlDNssLErbqdks5ttBOegaJpZM4SxGx4 .

noporpoise commented 6 years ago

Hi Juan,

Yes you can thread a few reads at a time (using mccortex31 thread) then merge the resulting .ctp files with mccortex31 pjoin:

  1. Partition reads into several files reads1.fa, ..., readsN.fa
  2. mccortex thread -1 reads1.fa -o reads1.ctp graph.ctx
  3. mccortex thread -1 reads2.fa -o reads2.ctp graph.ctx
  4. mccortex pjoin -o allreads.ctp reads1.ctp reads2.ctp ... readsN.ctp

We don't need to store all links in memory as we are threading but unfortunately this implementation does. So we can safely do the threading step in stages and merge the result.