nanoporetech / pomoxis

Analysis components from Oxford Nanopore Research
Other
92 stars 23 forks source link

Running mini_assemble with large, high coverage fastq file #29

Closed avitalsteiman closed 5 years ago

avitalsteiman commented 5 years ago

Hi, I had a similar issue (I think its similar). I ran mini_assemble for a very large fastq file that has high coverage. The script ran but in the end the assembly output was an empty fasta file. I also noticed by using top that despite using the -t 12 parameter with mini_assemble, minimap2 was running on only one thread, and I also noticed that the minimap2 command was coming out with -t12 rather than -t 12 as is described in the minimap2 manual.

In addition noticed in the screen output that minimap2 not returning any output, so I ran it independently and saw that it was killed due to lack of memory. I re-ran it with a smaller batch size using the -K parameter and it worked.
My question is how can I do this with mini_assemble? Is it possible to control the minimap2 batch size? The command and screen output of my original mini_assemble run is below.

Thanks, Avital

(pomoxis) (base) biomesh@biomesh:~/fastq$ ../../../../../../usr/bin/time -v mini_assemble -i test3_filt_q80minLen500.fq -o assembledminLen500 -p test3_filt_q80minLen500_assm -t 12 Copying FASTX input to workspace: test3_filt_q80minLen500.fq > assembledminLen500/test3_filt_q80minLen500_assm.fa.gz Skipped adapter trimming. Skipped pre-assembly correction. Overlapping reads... [M::mm_idx_gen::12.4341.65] collected minimizers [M::mm_idx_gen::14.3482.73] sorted minimizers [M::main::14.3482.73] loaded/built the index for 211416 target sequence(s) [M::mm_mapopt_update::14.7292.69] mid_occ = 5927 [M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 211416 [M::mm_idx_stat::14.9612.66] distinct minimizers: 25289437 (69.09% are singletons); average occurrences: 8.810; average spacing: 2.921 Assembling graph... [M::main] ===> Step 1: reading read mappings <=== [M::ma_hit_read::0.00021.43] read 0 hits; stored 0 hits and 0 sequences (0 bp) [M::main] ===> Step 2: 1-pass (crude) read selection <=== [M::ma_hit_sub::0.00016.87] 0 query sequences remain after sub [M::ma_hit_cut::0.00013.81] 0 hits remain after cut [M::ma_hit_flt::0.00013.55] 0 hits remain after filtering; crude coverage after filtering: -nan [M::main] ===> Step 3: 2-pass (fine) read selection <=== [M::ma_hit_sub::0.00012.93] 0 query sequences remain after sub [M::ma_hit_cut::0.00012.57] 0 hits remain after cut [M::ma_hit_contained::0.00012.26] 0 sequences and 0 hits remain after containment removal [M::main] ===> Step 4: graph cleaning <=== [M::ma_sg_gen] read 0 arcs [M::main] ===> Step 4.1: transitive reduction <=== [M::asg_arc_del_trans] transitively reduced 0 arcs [M::main] ===> Step 4.2: initial tip cutting and bubble popping <=== [M::asg_cut_tip] cut 0 tips [M::asg_arc_del_multi] removed 0 multi-arcs [M::asg_arc_del_asymm] removed 0 asymmetric arcs [M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips [M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <=== [M::asg_arc_del_short] removed 0 short overlaps [M::asg_arc_del_short] removed 0 short overlaps [M::asg_arc_del_short] removed 0 short overlaps [M::main] ===> Step 4.4: removing short internal sequences and bi-loops <=== [M::asg_cut_internal] cut 0 internal sequences [M::asg_cut_biloop] cut 0 small bi-loops [M::asg_cut_tip] cut 0 tips [M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips [M::main] ===> Step 4.5: aggressively cutting short overlaps <=== [M::asg_arc_del_short] removed 0 short overlaps [M::main] ===> Step 5: generating unitigs <=== [M::main] Version: 0.3-r179 [M::main] CMD: miniasm -s 100 -e 3 -f test3_filt_q80minLen500_assm.fa.gz test3_filt_q80minLen500_assm.paf.gz [M::main] Real time: 3.283 sec; CPU: 3.280 sec Running racon read shuffle 1... Running round 1 consensus... [M::mm_idx_gen::0.0020.63] collected minimizers [M::mm_idx_gen::0.0022.10] sorted minimizers [M::main::0.0022.10] loaded/built the index for 0 target sequence(s) [M::mm_mapopt_update::0.0022.09] mid_occ = 1 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0 [M::mm_idx_stat::0.0032.07] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan [M::worker_pipeline::4.2793.98] mapped 162861 sequences [M::worker_pipeline::5.1114.21] mapped 48555 sequences [M::main] Version: 2.14-r883 [M::main] CMD: minimap2 -t12 test3_filt_q80minLen500_assm.gfa.fa.gz test3_filt_q80minLen500_assm.fa.gz [M::main] Real time: 5.111 sec; CPU: 21.517 sec; Peak RSS: 0.629 GB [racon::Polisher::initialize] error: empty target sequences set! Running round 2 consensus... [M::mm_idx_gen::0.0003.03] collected minimizers [M::mm_idx_gen::0.0014.88] sorted minimizers [M::main::0.0014.86] loaded/built the index for 0 target sequence(s) [M::mm_mapopt_update::0.0014.75] mid_occ = 1 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0 [M::mm_idx_stat::0.0014.65] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan [M::worker_pipeline::4.2084.04] mapped 162861 sequences [M::worker_pipeline::5.0064.28] mapped 48555 sequences [M::main] Version: 2.14-r883 [M::main] CMD: minimap2 -t12 racon_1_1.fa.gz test3_filt_q80minLen500_assm.fa.gz [M::main] Real time: 5.006 sec; CPU: 21.434 sec; Peak RSS: 0.629 GB [racon::Polisher::initialize] error: empty target sequences set! Running round 3 consensus... [M::mm_idx_gen::0.0002.93] collected minimizers [M::mm_idx_gen::0.0014.10] sorted minimizers [M::main::0.0014.09] loaded/built the index for 0 target sequence(s) [M::mm_mapopt_update::0.0013.99] mid_occ = 1 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0 [M::mm_idx_stat::0.0013.90] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan [M::worker_pipeline::4.2554.05] mapped 162861 sequences [M::worker_pipeline::4.8884.41] mapped 48555 sequences [M::main] Version: 2.14-r883 [M::main] CMD: minimap2 -t12 racon_1_2.fa.gz test3_filt_q80minLen500_assm.fa.gz [M::main] Real time: 4.889 sec; CPU: 21.536 sec; Peak RSS: 0.628 GB [racon::Polisher::initialize] error: empty target sequences set! Running round 4 consensus... [M::mm_idx_gen::0.0003.57] collected minimizers [M::mm_idx_gen::0.0016.03] sorted minimizers [M::main::0.0016.00] loaded/built the index for 0 target sequence(s) [M::mm_mapopt_update::0.0015.82] mid_occ = 1 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0 [M::mm_idx_stat::0.0015.61] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan [M::worker_pipeline::4.2953.96] mapped 162861 sequences [M::worker_pipeline::5.1554.18] mapped 48555 sequences [M::main] Version: 2.14-r883 [M::main] CMD: minimap2 -t12 racon_1_3.fa.gz test3_filt_q80minLen500_assm.fa.gz [M::main] Real time: 5.156 sec; CPU: 21.538 sec; Peak RSS: 0.628 GB [racon::Polisher::initialize] error: empty target sequences set! Waiting for cleanup. rm: cannot remove 'shuffled': No such file or directory rm: cannot remove 'paf*': No such file or directory Final assembly written to assembledminLen500/test3_filt_q80minLen500_assm_final.fa. Have a nice day. Command being timed: "mini_assemble -i test3_filt_q80minLen500.fq -o assembledminLen500 -p test3_filt_q80minLen500_assm -t 12" User time (seconds): 20353.59 System time (seconds): 53.28 Percent of CPU this job got: 1082% Elapsed (wall clock) time (h:mm:ss or m:ss): 31:24.97 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 63910484 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 14171 Minor (reclaiming a frame) page faults: 25610878 Voluntary context switches: 220518 Involuntary context switches: 2653733 Swaps: 0

cjw85 commented 5 years ago

Hi @avitalsteiman

I have pushed a change that adds a -K option to the mini_assemble program. This option is passed to all calls of minimap2.

avitalsteiman commented 5 years ago

Amazing! Thank you!

avitalsteiman commented 5 years ago

Hi, I appreciate your help. Seems I closed the issue too soon... I hope you don't mind this rookie question. I updated my files with the "git pull origin master " command, and I see the changes in the mini_assemble script, I even reactivated the Pomoxis environment, but when I run the mini_assemble command with -K I receive an error that it is an invalid option. I tried different versions of the command with -K5 -K5M -K 5 etc but received the same error. Here is an example: (pomoxis) (base) biomesh@biomesh:~/avital$ mini_assemble -i test3.fq -o assembled -p test3_assmbl -K 5 -t 12 -c Invalid option: -K

Thanks, Avital

On Wed, Feb 20, 2019 at 5:33 PM cjw85 notifications@github.com wrote:

Hi @avitalsteiman https://github.com/avitalsteiman

I have pushed a change https://github.com/nanoporetech/pomoxis/commit/509bc6f2d30d52bf5574bd351dbfa93d849860de that adds a -K option to the mini_assemble program. This option is passed to all calls of minimap2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pomoxis/issues/29#issuecomment-465627280, or mute the thread https://github.com/notifications/unsubscribe-auth/Ase1x9w2uO2fPD67QTmaOewwx0qb7m6jks5vPWqvgaJpZM4bEz80 .

cjw85 commented 5 years ago

Hi,

As well as the commands you have run, you will also need to run

python setup.py install

from the pomoxis directory to have the updated program available for use.

avitalsteiman commented 5 years ago

Thanks

On Thu, Feb 21, 2019 at 10:32 AM cjw85 notifications@github.com wrote:

Hi,

As well as the commands you have run, you will also need to run

python setup.py install

from the pomoxis directory to have the updated program available for use.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pomoxis/issues/29#issuecomment-465908586, or mute the thread https://github.com/notifications/unsubscribe-auth/Ase1x--I4our3YuuqxZlkOYHSLlHH3C_ks5vPlmtgaJpZM4bEz80 .