ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

-g option (small genome) #183

Closed tay45 closed 4 years ago

tay45 commented 4 years ago

Hello,

I am trying to assemble a viral genome (~ 190 kb). But, it has delivered 0 contig when I set '-g 190k' with a questionable warning message (" WARNNING: input file is not in gzip format "). But, my input was a *.fasta.gz.

The result has been the same until I increased the -g parameter to 500k. But, with that parameter, the contig was too short.

When I arbitrarily applied any big genome size (-g 4.6m), it provided a reasonable contig.

Do you have any comment regarding the '-g option' how to use? I attached my command and the log as below.

Thank you!


wtdbg2 -x rs -g 190k -t 16 -i ../subreads.fasta.gz -fo hov3_p1

wtpoa-cns -t 16 -i hov3_p1.ctg.lay.gz -fo hov3_p1.ctg.fa

WTDBG: De novo assembler for long noisy sequences Author: Jue Ruan ruanjue@gmail.com Version: 2.5 (20190621) Usage: wtdbg2 [options] -i -o [reads.fa ...] Options: -i Long reads sequences file (REQUIRED; can be multiple), [] -o Prefix of output files (REQUIRED), [] -t Number of threads, 0 for all cores, [4] -f Force to overwrite output files -x Presets, comma delimited, [] preset1/rsII/rs: -p 21 -S 4 -s 0.05 -L 5000 preset2: -p 0 -k 15 -AS 2 -s 0.05 -L 5000 preset3: -p 19 -AS 2 -s 0.05 -L 5000 sequel/sq nanopore/ont: (genome size < 1G: preset2) -p 0 -k 15 -AS 2 -s 0.05 -L 5000 (genome size >= 1G: preset3) -p 19 -AS 2 -s 0.05 -L 5000 preset4/corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5 -g Approximate genome size (k/m/g suffix allowed) [0] -X Choose the best depth from input reads(effective with -g) [50.0] -L Choose the longest subread and drop reads shorter than (5000 recommended for PacBio) [0] Negative integer indicate tidying read names too, e.g. -5000. -k Kmer fsize, 0 <= k <= 23, [0] -p Kmer psize, 0 <= p <= 23, [21] k + p <= 25, seed is + -K Filter high frequency kmers, maybe repetitive, [1000.05]

= 1000 and indexing >= (1 - 0.05) * total_kmers_count -S Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00] -S is very useful in saving memeory and speeding up please note that subsampling kmers will have less matched length -l Min length of alignment, [2048] -m Min matched length by kmer matching, [200] -R Enable realignment mode -A Keep contained reads during alignment -s Min similarity, calculated by kmer matched length / aligned length, [0.05] -e Min read depth of a valid edge, [3] -q Quiet -v Verbose (can be multiple) -V Print version information and then exit --help Show more options

-- total memory 263847036.0 kB -- available 261383516.0 kB -- 28 cores -- Starting program: wtdbg2 -x rs -g 190k -t 16 -i /net/isi-dcnl/ifs/user_data/Seq/pacbio_analysis/thkang/2019/Shyam/canu/P1/subreads.fasta.gz -o hov3_p1 -- pid 11291 -- date Mon Mar 30 13:56:01 2020

[Mon Mar 30 13:56:01 2020] loading reads

0 10000 20000 30000 40000 50000 60000 70000 77246 reads [Mon Mar 30 13:56:06 2020] filtering from 77246 reads (>=5000 bp), 876146697 bp. Try selecting 9500000 bp [Mon Mar 30 13:56:06 2020] Done, 343 reads (>=5000 bp), 9523712 bp, 37101 bins PROC_STAT(0) : real 5.211 sec, user 9.580 sec, sys 0.920 sec, maxrss 434764.0 kB, maxvsize 690776.0 kB [Mon Mar 30 13:56:06 2020] Set --edge-cov to 3 KEY PARAMETERS: -k 0 -p 21 -K 1000.049988 -S 4.000000 -s 0.050000 -g 190000 -X 50.000000 -e 3 -L 5000 [Mon Mar 30 13:56:06 2020] generating nodes, 16 threads [Mon Mar 30 13:56:06 2020] indexing bins[(0,37101)/37101] (9497856/866301696 bp), 16 threads [Mon Mar 30 13:56:06 2020] - scanning kmers (K0P21S4.00) from 37101 bins

0 37101 bins ** Kmer Frequency **

** 1 - 201 ** Quatiles: 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 1 1 1 1 1 1 1 1 1 1 PROC_STAT(0) : real 5.512 sec, user 10.560 sec, sys 1.150 sec, maxrss 482640.0 kB, maxvsize 1816224.0 kB [Mon Mar 30 13:56:06 2020] - high frequency kmer depth is set to 1000 [Mon Mar 30 13:56:06 2020] - Total kmers = 1563366 [Mon Mar 30 13:56:06 2020] - average kmer depth = 2 [Mon Mar 30 13:56:06 2020] - 1560266 low frequency kmers (<2) [Mon Mar 30 13:56:06 2020] - 0 high frequency kmers (>1000) [Mon Mar 30 13:56:06 2020] - indexing 3100 kmers, 6614 instances (at most)

0 37101 bins [Mon Mar 30 13:56:06 2020] - indexed 3100 kmers, 6568 instances [Mon Mar 30 13:56:06 2020] - masked 35981 bins as closed [Mon Mar 30 13:56:06 2020] - sorting PROC_STAT(0) : real 5.512 sec, user 10.560 sec, sys 1.150 sec, maxrss 482640.0 kB, maxvsize 1816224.0 kB [Mon Mar 30 13:56:06 2020] Done

0|0 342 reads|total hits 0 PROC_STAT(0) : real 5.712 sec, user 11.980 sec, sys 1.230 sec, maxrss 484752.0 kB, maxvsize 1816224.0 kB [Mon Mar 30 13:56:06 2020] sorting rdhits ... Done [Mon Mar 30 13:56:06 2020] clipping ... 100.00% bases [Mon Mar 30 13:56:06 2020] generating regs ... 0 [Mon Mar 30 13:56:06 2020] sorting regs ... Done [Mon Mar 30 13:56:06 2020] generating intervals ... 0 intervals [Mon Mar 30 13:56:06 2020] selecting important intervals from 0 intervals [Mon Mar 30 13:56:06 2020] Intervals: kept 0, discarded 0 PROC_STAT(0) : real 5.712 sec, user 11.980 sec, sys 1.230 sec, maxrss 484752.0 kB, maxvsize 1816224.0 kB [Mon Mar 30 13:56:06 2020] Done, 0 nodes [Mon Mar 30 13:56:06 2020] output "hov3_p1.1.nodes". Done. [Mon Mar 30 13:56:06 2020] median node depth = 0 [Mon Mar 30 13:56:06 2020] masked 0 high coverage nodes (>200 or <3) [Mon Mar 30 13:56:06 2020] masked 0 repeat-like nodes by local subgraph analysis [Mon Mar 30 13:56:06 2020] generating edges [Mon Mar 30 13:56:06 2020] Done, 1 edges [Mon Mar 30 13:56:06 2020] output "hov3_p1.1.reads". Done. [Mon Mar 30 13:56:06 2020] output "hov3_p1.1.dot.gz". Done. [Mon Mar 30 13:56:06 2020] graph clean [Mon Mar 30 13:56:06 2020] rescued 0 low cov edges [Mon Mar 30 13:56:06 2020] deleted 0 binary edges [Mon Mar 30 13:56:06 2020] deleted 0 isolated nodes [Mon Mar 30 13:56:06 2020] cut 0 transitive edges [Mon Mar 30 13:56:06 2020] output "hov3_p1.2.dot.gz". Done. [Mon Mar 30 13:56:06 2020] deleted 0 isolated nodes [Mon Mar 30 13:56:06 2020] output "hov3_p1.3.dot.gz". Done. [Mon Mar 30 13:56:06 2020] cut 0 branching nodes [Mon Mar 30 13:56:06 2020] deleted 0 isolated nodes [Mon Mar 30 13:56:06 2020] building unitigs [Mon Mar 30 13:56:06 2020] [Mon Mar 30 13:56:06 2020] output "hov3_p1.frg.nodes". Done. [Mon Mar 30 13:56:06 2020] generating links [Mon Mar 30 13:56:06 2020] generated 1 links [Mon Mar 30 13:56:06 2020] output "hov3_p1.frg.dot.gz". Done. [Mon Mar 30 13:56:07 2020] rescue 0 weak links [Mon Mar 30 13:56:07 2020] deleted 0 binary links [Mon Mar 30 13:56:07 2020] cut 0 transitive links [Mon Mar 30 13:56:07 2020] remove 0 boomerangs [Mon Mar 30 13:56:07 2020] remove 0 weak branches [Mon Mar 30 13:56:07 2020] cut 0 tips [Mon Mar 30 13:56:07 2020] pop 0 bubbles [Mon Mar 30 13:56:07 2020] detached 0 repeat-associated paths [Mon Mar 30 13:56:07 2020] cut 0 tips [Mon Mar 30 13:56:07 2020] output "hov3_p1.ctg.dot.gz". Done. [Mon Mar 30 13:56:07 2020] building contigs [Mon Mar 30 13:56:07 2020] searched 0 contigs [Mon Mar 30 13:56:07 2020] Estimated: [Mon Mar 30 13:56:07 2020] output 0 contigs [Mon Mar 30 13:56:07 2020] Program Done PROC_STAT(TOTAL) : real 6.013 sec, user 12.030 sec, sys 1.340 sec, maxrss 502244.0 kB, maxvsize 1816224.0 kB

-- -- total memory 263847036.0 kB -- available 261383312.0 kB -- 28 cores -- Starting program: wtpoa-cns -t 16 -i hov3_p1.ctg.lay.gz -fo hov3_p1.ctg.fa -- pid 11525 -- date Mon Mar 30 13:56:07 2020

WARNNING: input file is not in gzip format

0 contigs 0 edges 0 bases PROC_STAT(TOTAL) : real 0.103 sec, user 0.000 sec, sys 0.020 sec, maxrss 9864.0 kB, maxvsize 1178476.0 kB

ruanjue commented 4 years ago

May be contaminated with host genome, so that if selecting 50X reads, you get few virus reads. -g is used to select -X 50 reads and estimate edge-cov-cutoff. Ignore it when the genome size is very small. Tuning -e 3 to a large value when you have too high sequence coverage.

tay45 commented 4 years ago

Hello Jue,

Thank you for your comments!

Taehee

2020년 3월 30일 (월) 오후 8:11, Jue Ruan notifications@github.com님이 작성:

May be contaminated with host genome, so that if selecting 50X reads, you get few virus reads. -g is used to select -X 50 reads and estimate edge-cov-cutoff. Ignore it when the genome size is very small. Tuning -e 3 to a large value when you have too high sequence coverage.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ruanjue/wtdbg2/issues/183#issuecomment-606372425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYIACAN35ZO4DPFRHAUAVTRKFNNLANCNFSM4LXB74KA .

katievigil commented 1 year ago

@ruanjue What wtdbg2 command do you recommend running for viral metagenomic samples with small genomes and low coverage?