ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Parameters for assembling short sequences (generating small assemblies) #244

Closed javiercguard closed 2 years ago

javiercguard commented 2 years ago

Hi, I'm not sure if this is possible to do, but I'd like to assemble short sequences (extracted from nanopore reads, so they are noisy) into a consensus. The length of the sequences could be ~150 bp. For example, 8 sequences in the length range 145-175bp. I've set -L to 100. It loads the "reads", but it generates no k-mers:

Log: ``` -- Starting program: wtdbg2 -i fasta.fasta -f -o wtest/test -x ont -L 100 -g 166 -- pid 130567 -- date Thu Feb 24 18:34:51 2022 -- [Thu Feb 24 18:34:51 2022] loading reads 15 reads [Thu Feb 24 18:34:51 2022] Done, 15 reads (>=100 bp), 2502 bp, 0 bins ** PROC_STAT(0) **: real 0.009 sec, user 0.000 sec, sys 0.000 sec, maxrss 1040.0 kB, maxvsize 86220.0 kB [Thu Feb 24 18:34:51 2022] Set --edge-cov to 2 KEY PARAMETERS: -k 15 -p 0 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 166 -X 50.000000 -e 2 -L 100 [Thu Feb 24 18:34:51 2022] generating nodes, 4 threads [Thu Feb 24 18:34:51 2022] indexing bins[(0,0)/0] (0/0 bp), 4 threads [Thu Feb 24 18:34:51 2022] - scanning kmers (K15P0S2.00) from 0 bins 0 bins ********************** Kmer Frequency ********************** ********************** 1 - 201 ********************** Quatiles: 10% 20% 30% 40% 50% 60% 70% 80% 90% 95% 0 0 0 0 0 0 0 0 0 0 ** PROC_STAT(0) **: real 0.009 sec, user 0.000 sec, sys 0.000 sec, maxrss 1040.0 kB, maxvsize 86220.0 kB [Thu Feb 24 18:34:51 2022] - high frequency kmer depth is set to 65535 [Thu Feb 24 18:34:51 2022] - Total kmers = 0 [Thu Feb 24 18:34:51 2022] - average kmer depth = 0 [Thu Feb 24 18:34:51 2022] - 0 low frequency kmers (<2) [Thu Feb 24 18:34:51 2022] - 0 high frequency kmers (>65535) [Thu Feb 24 18:34:51 2022] - indexing 0 kmers, 0 instances (at most) 0 bins [Thu Feb 24 18:34:51 2022] - indexed 0 kmers, 0 instances [Thu Feb 24 18:34:51 2022] - masked 0 bins as closed [Thu Feb 24 18:34:51 2022] - sorting ** PROC_STAT(0) **: real 0.009 sec, user 0.000 sec, sys 0.000 sec, maxrss 1040.0 kB, maxvsize 86220.0 kB [Thu Feb 24 18:34:51 2022] Done 0 reads|total hits 0 ** PROC_STAT(0) **: real 0.009 sec, user 0.000 sec, sys 0.000 sec, maxrss 1040.0 kB, maxvsize 86220.0 kB [Thu Feb 24 18:34:51 2022] sorting rdhits ... Done [Thu Feb 24 18:34:51 2022] clipping ... -nan% bases [Thu Feb 24 18:34:51 2022] generating regs ... 0 [Thu Feb 24 18:34:51 2022] sorting regs ... Done [Thu Feb 24 18:34:51 2022] generating intervals ... 0 intervals [Thu Feb 24 18:34:51 2022] selecting important intervals from 0 intervals [Thu Feb 24 18:34:51 2022] Intervals: kept 0, discarded 0 ** PROC_STAT(0) **: real 0.009 sec, user 0.000 sec, sys 0.000 sec, maxrss 1040.0 kB, maxvsize 86220.0 kB [Thu Feb 24 18:34:51 2022] Done, 0 nodes [Thu Feb 24 18:34:51 2022] output "wtest/test.1.nodes". Done. [Thu Feb 24 18:34:51 2022] median node depth = 0 [Thu Feb 24 18:34:51 2022] masked 0 high coverage nodes (>200 or <2) [Thu Feb 24 18:34:51 2022] masked 0 repeat-like nodes by local subgraph analysis [Thu Feb 24 18:34:51 2022] generating edges [Thu Feb 24 18:34:51 2022] Done, 1 edges [Thu Feb 24 18:34:51 2022] output "wtest/test.1.reads". Done. [Thu Feb 24 18:34:51 2022] output "wtest/test.1.dot.gz". Done. [Thu Feb 24 18:34:51 2022] graph clean [Thu Feb 24 18:34:51 2022] rescued 0 low cov edges [Thu Feb 24 18:34:51 2022] deleted 0 binary edges [Thu Feb 24 18:34:51 2022] deleted 0 isolated nodes [Thu Feb 24 18:34:51 2022] cut 0 transitive edges [Thu Feb 24 18:34:51 2022] output "wtest/test.2.dot.gz". Done. [Thu Feb 24 18:34:51 2022] deleted 0 isolated nodes [Thu Feb 24 18:34:51 2022] output "wtest/test.3.dot.gz". Done. [Thu Feb 24 18:34:51 2022] cut 0 branching nodes [Thu Feb 24 18:34:51 2022] deleted 0 isolated nodes [Thu Feb 24 18:34:51 2022] building unitigs [Thu Feb 24 18:34:51 2022] [Thu Feb 24 18:34:51 2022] output "wtest/test.frg.nodes". Done. [Thu Feb 24 18:34:51 2022] generating links [Thu Feb 24 18:34:51 2022] generated 1 links [Thu Feb 24 18:34:51 2022] output "wtest/test.frg.dot.gz". Done. [Thu Feb 24 18:34:51 2022] rescue 0 weak links [Thu Feb 24 18:34:51 2022] deleted 0 binary links [Thu Feb 24 18:34:51 2022] cut 0 transitive links [Thu Feb 24 18:34:51 2022] remove 0 boomerangs [Thu Feb 24 18:34:51 2022] remove 0 weak branches [Thu Feb 24 18:34:51 2022] cut 0 tips [Thu Feb 24 18:34:51 2022] pop 0 bubbles [Thu Feb 24 18:34:51 2022] detached 0 repeat-associated paths [Thu Feb 24 18:34:51 2022] cut 0 tips [Thu Feb 24 18:34:51 2022] output "wtest/test.ctg.dot.gz". Done. [Thu Feb 24 18:34:51 2022] building contigs [Thu Feb 24 18:34:51 2022] searched 0 contigs [Thu Feb 24 18:34:51 2022] Estimated: [Thu Feb 24 18:34:51 2022] output 0 contigs [Thu Feb 24 18:34:51 2022] Program Done ** PROC_STAT(TOTAL) **: real 0.109 sec, user 0.050 sec, sys 0.050 sec, maxrss 43888.0 kB, maxvsize 422720.0 kB --- ```

I've tried using low values for -K, -e, omitting -g, to no avail. Is it possible to generate a small assembly on purpose?

Thanks!

ruanjue commented 2 years ago

It is a problem of multiple sequence alignment, bsalign poa will help you. https://github.com/ruanjue/bsalign

javiercguard commented 2 years ago

I've been trying bsalign, but the consensus was longer than I desired, I decided to use wtdbg2 passing longer sequences. Thanks!