mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
760 stars 165 forks source link

misassembly in a ONT assembly #397

Closed estolle closed 2 years ago

estolle commented 3 years ago

Hi there,

I've been trying to generate a good assembly of a bee (flye v2.8.1 in fresh conda env, Ubuntu 20.04), ~250Mb genome size , containing a fair amount (~18%) of LTR etc retrotransposons. I have promethion ONT reads (v9.4) and >150x coverage. I rebasecalled last week with the latest guppy (v4), chopped off adapters & low quality (prowler), and further filtered for >=10kb and >= Q13 (i.e. p error: 0.05) using filtlong. this gives ~80x coverage (flye later says something like 77x coverage if I remembr correctly).

I made a few dozen assemblies with >=10kb / 20 / 30 / 40 kb and with various min overlap, mostly 7500 and 10000 bp.

Yet I still have a problem that there is a misassembly introduced in many of these quite nice assemblies. I know its a misassembly because we have an ok'ish chromosome-anchored (genetic linkage map) old assembly generated years ago from short read data (chromosome structure is conserved among related species based on several new highQ assemblies).

The misassembly I am getting seems to be a fusion of one chromosome (16) to another (sometimes that is chr3, 4, or 5). This happens even when using >=40kb reads (Q13! i.e. average error rate 5%) and 10kb min overlap (coverage here was 18x). I now dug deeper and can see its introduced in the final contigs, but not yet present in the first and second steps (draft_assembly.fasta and consensus.fasta). It seems to fuse the two consensus.fasta disjointigs based on an overlapping region of 18kb which appears to have a certain similarity even though each disjointig extends 50-70kb further without similarity.

I attached few IGV images. misassembly1 misassembly2 misassembly3

Is there anything I can change in terms of parameters within flye (even somewhere in the code) to avoid this wrong fusion and clipping of the disjointig ends?

I thought of removing reads containing telomeric sequences (TTAGG in this species), but its only 49 reads with such repeat arrays of >=200bp / 80% ID. Or remove the reads overlapping this problematic region ... but then the sequence extending beyond is lost.

thanks

mikolmogorov commented 3 years ago

Hi,

Thanks for the detailed report.

  1. Are these fusions go through the telomeres?
  2. I think I kinda see a few reads in the first plot that traverse the repetitive region from left to right - but it is hard to see. Is that the case? If so, this is probably the reason Flye is making this connection. It either could be a real connection, or a bunch of chimeric reads (which does happen sometimes). It would be easier to see the plots if you hide indels shorter than 5 bp in IGV settings.
  3. Could you please add 20-repeat/graph_before_rr.fasta file to the IGV plot? Mikhail
estolle commented 3 years ago

Hi,

thanks for your comments. I am not sure if there is a telomeric sequence here, but it looks repetitive, perhaps one of the satellite/low complexity regions. But doesn't look as it contains the TTAGG telomeric repeat.

I also do not really see reads spanning the entire region, in the IGV plot, what looks like single reads spanning it, are two or more, the ends are v close due to clipping.

I attached the graph_before_rr.fasta after the draft and consensus fasta, just below that (before the raw reads) is the coverage of the repeat_graph_edges.fasta

misassembly4 misassembly5 misassembly6 misassembly7

estolle commented 3 years ago

I found another misassembly (fusion between 3 chromosomes) in a scaffold, where the fusion is already present in the disjointig/contig. Also here I cannot identify a telomeric repeat sequence. scf47

mikolmogorov commented 3 years ago

@estolle Thanks! It looks like there is a complex repeat structure in this region. To me it looks like a mosaic repeat with sections that are very similar and other sections that are divergent. Flye seems to be not recognizing a divergent repeat copy and making a false connection. I have a few following suggestions:

  1. Could you try the latest release (2.8.3) and also the latest github code (that you'd need to build). Both contain improved and slightly different approaches to consensus calling, compared to 2.8.1, which should improve graph representation of the repetitive region. Please send the full flye.log files from both runs.
  2. If both runs still contain the fusions, we can try some parameter adjustments.

Mikhail

estolle commented 3 years ago

Dear Mikhail

I ran 2 more assemblies with the 2.8.3 conda release and the current github version. both contain one of the large misassemblies (each a different one) and the github -flye assembly looks a bit more noisy just from a point of synteny to a reference.

I attached both log files, which hopefully give some insights. github-flye.log flye.log

both times the command was: flye --nano-raw $INPUT --out-dir $OUTDIR --threads $CPUs --iterations 0 --min-overlap 10000

Input are ONT 9.4.1/450bps reads, re-basecalled with the very recently released new Guppy and the superaccurate model, adaptertrimmed/Q-filtered with prowler and filtlong, 10kb or larger retained, Q95 retained, N's in homopolymer stretches removed , >75x coverage going into the analysis

these are the basic stats for both assemblies: flye 2.8.3-b1695 (conda) Total length: 256823929 Fragments: 490 Fragments N50: 10327525 Largest frg: 16695365 Scaffolds: 7 Mean coverage: 76

flye 2.8.3-b1725 (github) Total length: 256974731 Fragments: 520 Fragments N50: 8715856 Largest frg: 22939892 Scaffolds: 0 Mean coverage: 75

mikolmogorov commented 3 years ago

Thanks! Ok then, let's see if the newest Guppy data can benefit from alignment sensitivity adjustments. Please try running the latest release with the same parameters and this extra command line: --extra-params assemble_ovlp_divergence=0.05,repeat_graph_ovlp_divergence=0.05,read_align_ovlp_divergence=0.05

mikolmogorov commented 3 years ago

@estolle any updates on that?

estolle commented 3 years ago

Sry for the delay in posting the results.

Using the extra-params I still get the same misassemblies unfortunately.

In the meantime I used Canu-corrected reads to try out and these did not produce the misassemblies.

I do however not know if the corrected reads were trimmed and may not have the sequences anymore causing the mis-assembly.

I continued the process with these corrected reads but for future projects there might be similar problems. Any other suggestions to try out?

Thanks alot

mikolmogorov commented 3 years ago

Might makes sense to try the new release (2.9) - it improves on repeat detection in harder cases, which might help in your situation. And there is a new mode for Guppy 5 data as well.

estolle commented 2 years ago

I ran the same data through flye 2.9 and indeed the result looks much better in terms of misassemblies (i do not see any major assemblies anymore). The resulting assembly however is muchlarger. 349 Mb rather than ~260Mb before (with raw or Canu-corrected reads) here are the stats: ID total_length number mean_length longest short N Gaps N50 N50n N70 N70n N90 N90n flye.assembly.fasta 349437402 1387 251937.56 15088732 502 0 0 5669873 16 1773630 39 161137 211

for comparison: Canu.contigs.fasta 328494814 783 419533.61 15004955 8538 0 0 8124212 14 1810846 33 105601 316

flye.assembly.from.CanuCorrectedReads 263408931 692 380648.74 14949467 26 90 9 8756033 11 4101835 19 1146327 43

This was my command and some parts of the output: flye --nano-hq $INPUT --out-dir $OUTDIR --genome-size 320m --threads 80 --iterations 0 --plasmids --min-overlap 10000

[2021-09-13 16:40:39] INFO: Total read length: 24209215391 [2021-09-13 16:40:39] INFO: Input genome size: 320000000 [2021-09-13 16:40:39] INFO: Estimated coverage: 75 [2021-09-13 16:40:39] INFO: Reads N50/N90: 26133 / 12526 [2021-09-13 16:40:39] INFO: Selected minimum overlap: 10000 [2021-09-13 16:50:01] INFO: Extending reads [2021-09-13 16:54:18] INFO: Overlap-based coverage: 69 [2021-09-13 16:54:18] INFO: Median overlap divergence: 0.0477127 [2021-09-13 20:36:08] INFO: Computing consensus [2021-09-13 20:42:44] INFO: Alignment error rate: 0.062840 [2021-09-13 20:42:49] INFO: Building repeat graph [2021-09-13 20:47:48] INFO: Median overlap divergence: 0.0777643 [2021-09-13 20:52:05] INFO: Aligning reads to the graph [2021-09-13 21:02:18] INFO: Aligned read sequence: 22569934857 / 24208105391 (0.93233) [2021-09-13 21:02:18] INFO: Median overlap divergence: 0.0243967 [2021-09-13 21:02:20] INFO: Mean edge coverage: 70

Would you think this behavior is normal or is this perhaps due to 3-5% error rate expectation for the high quality Nanopore reads? Above you can see overlap divergence in 7.7% for repeats and 4.8% for the rest. Using previous versions of flye (with --nano-raw) these were more like 9-10%.

The input data are Prometheon data (R9.4.1 flowcells) basecalled with Guppy5 in SUP mode and subsequently filtered for Q95 reads and >=10kb.

Thanks!

mikolmogorov commented 2 years ago

@estolle glad to hear that the new version helped with misassemblies!

Yes, for highly repetive/heterozygous genome I would expect the assmebly to increase. This is a sign that more repetitive sequence was resolved - because higher quality reads allow to set more strict alignment thresholds (as you said).

estolle commented 2 years ago

i forgot to mention that the individual sequenced was a haploid bee male and from kmer based analyses (genomescope) we estimated the genomesize to be around 320 Mb. So I need to dig into the data a bit to find out if the flye assembly has some redundancies

mikolmogorov commented 2 years ago

@estolle might be due to repeats - if they have many identical copies genomescope might underestimate the size (?)