ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

wtdbg2 may produce an assembly smaller than the true genome #172

Closed Lordhooze closed 4 years ago

Lordhooze commented 4 years ago

Hi, ruanjue: I note that , for nanopore data, wtdbg2 may produce an assembly smaller than the true genome.

how to solve this problem.

three month ago, I assembled a genome using wtdbg2 (80 coverage). Contig N50 is very good.

However, after annotation, the busco score is only 90.

I guess, Wtdbg2 may miss some area of the genome, which lead to the low busco score

ruanjue commented 4 years ago

Try aligning the core CDS against the assembly to see whether they were missed in the assembly, or not well-polished.

Lordhooze commented 4 years ago

Ok, many thanks, I will do that.

melop commented 4 years ago

Dear Dr. Ruan,

Is there an explanation/speculation on why the assembly size would be smaller using ont data? For our genome, Illumina data always produce an assembly much smaller than the genome size estimate (2/3), we suspected the problem to be collapsed repeats. Thank you! Rongfeng Cui

ruanjue commented 4 years ago

I haven't found the exact reason. Might be: 1, missing in alignments (false negative) 2, collapsed tandom repeats 3, collapsed long repeats but not be untangled

PS: Not only for ONT, but worse.

melop commented 4 years ago

Dear Dr. Ruan, Thank you for the quick reply. What sequencing coverage will be needed by wtdbg2 until there is not much improvement? Would the performance be better if I first correct the input reads by 2nd generation data (say, with HALC)?

ruanjue commented 4 years ago

50X is ok, more than 80X should improve less. wtdbg2 was designed to handle with raw long noisy reads. If you already correct raw reads with CANU or other long reads self-correction tools, the results may be better, but not sure. I don't think long reads can be correctly corrected by NGS data. In fact, I also don't think long reads correction (excepting HiFi-reads) is correct. However, long reads correction looks very good.

melop commented 4 years ago

Thanks, I will experiment with different approaches.

melop commented 4 years ago

Another question, how about contigs assembled from NGS data, which are in general a few kb long? Do you think they would be useful for correction? I imagine that repeats cannot be corrected with this approach, but unique fractions of the genome could be?

ruanjue commented 4 years ago

Why not to use the NGS contig to correct TGS contigs instead of correcting long reads.

melop commented 4 years ago

Good idea, I will try that. Thanks!

melop commented 4 years ago

Dear Dr. Ruan, I noticed that there's this option in wtdbg2: --err-free-nodes Select nodes from error-free-sequences only. E.g. you have contigs assembled from NGS-WGS reads, and long noisy reads. You can type '--err-free-seq your_ctg.fa --input your_long_reads.fa --err-free-nodes' to perform assembly somehow act as long-reads scaffolding

Is this designed for performing scaffolding using long reads if I input NGS contigs? Will it also fill in gaps if they are fillable by TGS reads? This seems to be a very nice option.

ruanjue commented 4 years ago

--err-free-nodes is combined with --err-free-seq. However, I have't supported them for a long time. I forgot to remove --err-free-nodes. If more available time, I will retrieve this function.

melop commented 4 years ago

Ah I see. I tried just now to input these parameters but the latest version of wtdbg2 doesn't seem to recognize them any more (just printing the help information).

Rongfeng