ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

more data -> worse assembly?? #149

Closed PeterEmmrich closed 5 years ago

PeterEmmrich commented 5 years ago

Hello! We are trying to put together a draft of a large plant genome from Nanopore data, but weirdly, putting in more coverage doesn't seem to improve the assembly!?

The first run we did with a PromethION dataset with this read length distribution, amounting to only about 20X: 3FC_distribution_stacked_100b.pdf

using these parameters: -x ont -g 8.2g -t 16 -S 2 --edge-min 2 --rescue-low-cov-edges

and got these initial results, which I was quite happy with for a first try

contigs (>= 0 bp) 163460 contigs (>= 10000 bp) 115478 contigs (>= 50000 bp) 31834 Largest contig 1195731 Total length 6108463310 N50 78873 L50 19080

We then got some more sequencing data, adding another 10X, most of it in 20kb+ reads whole set read lengths_stacked.pdf

and ran the assembly again using similar parameters (but more threads) -x preset3 -g 8.2g -t 112 --tidy-reads 2000 --edge-min 2 --rescue-low-cov-edges

but we got these results, which don't seem significantly better than the first round - a larger assembly, but in many more contigs and with a smaller largest contig and lower N50.

contigs (>= 0 bp) 208502 contigs (>= 10000 bp) 141093 contigs (>= 50000 bp) 34069 Largest contig 1170373 Total length 6710115277 N50 62988 L50 25731

Could you please advise on how we could improve this? Are there any parameters you would set differently given our dataset?

Many thanks, Peter

ruanjue commented 5 years ago

Maybe you can try '-p 19 -AS 2 -L 5000 -R --aln-dovetail -1 --drop-low-cov-edges'.

PeterEmmrich commented 5 years ago

Thank you for the quick reply, we will try that! Best wishes, Peter

mjpdejong commented 5 years ago

Thank you for the quick reply, we will try that! Best wishes, Peter

Have you noticed the additional 600 Mb of assembled sequence? This has impact on the N50/L50 values...

PeterEmmrich commented 5 years ago

Yes, that is certainly an improvement and could account for some (or all?) of the drop in assembly N50, but I was hoping for a better result, given that we were putting in an extra 50% of sequencing data, with a higher average read N50.

Anyway, the assembly is still running with the parameters Jue suggested (after a hickup at the HPC) and I'll give an update once it's done.

On Wed, 13 Nov 2019 16:40 Marco, notifications@github.com wrote:

Thank you for the quick reply, we will try that! Best wishes, Peter

Have you noticed the additional 600 Mb of assembled sequence? This has impact on the N50/L50 values...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ruanjue/wtdbg2/issues/149?email_source=notifications&email_token=ABJEI7KGJVXQYBN6IOVUFA3QTP7TFA5CNFSM4JGG2OB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED6FQCQ#issuecomment-553408522, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJEI7PVV5OFOZYTJ4HEYADQTP7TFANCNFSM4JGG2OBQ .

ruanjue commented 5 years ago

The inflation of genome size was mostly caused by fragmental alignments, which can be discarded by -l 4096 or even larger -l 8192.

PeterEmmrich commented 5 years ago

Hey, we've got a new assembly now, which is looking much better! The total assembly size has gone down again, but the largest contig and the N50 have doubled compared to the first assembly and the L50 is down accordingly. It looks like we still have quite a <10k few fragment contigs, but we'll see what we can do from here by merging with other assemblies. Many thanks for your advice!

Assembly lsa3.ctg
contigs (>= 0 bp) 162994
contigs (>= 1000 bp) 162947
contigs (>= 5000 bp) 138511
contigs (>= 10000 bp) 85715
contigs (>= 25000 bp) 43848
contigs (>= 50000 bp) 23711
Total length (>= 0 bp) 6223992364 Total length (>= 1000 bp) 6223958958 Total length (>= 5000 bp) 6120814110 Total length (>= 10000 bp) 5751743953 Total length (>= 25000 bp) 5090391365 Total length (>= 50000 bp) 4383480177 contigs 162985
Largest contig 2768903
Total length 6223989194 GC (%) 38.66
N50 155574
N75 38461
L50 8679
L75 30223

cement-head commented 3 years ago

@PeterEmmrich Hi Peter, can you share your "final" CLI argument for the wtdbg2 assembler that you were most happy with? Thx