malonge / RagTag

Tools for fast and flexible genome assembly scaffolding and improvement
MIT License
473 stars 49 forks source link

Scaffolding inserts gaps that aren't covered by HiFi or ONT reads and aren't in reference #170

Closed nickgladman closed 11 months ago

nickgladman commented 1 year ago

Hello. Thanks for developing and maintaining this great software.

I scaffolded a genome that was de novo assembled with HiFi and ONT reads (>30x coverage for each read type). The primary assembly had strong BUSCO, Quast, and Merqury metrics (BUSCO > 98%, QV ~ 32, contigs ~ 600, N50 (auN) ~ 50 Mb.

The reference assembly is a high quality assembly that used ONT, HiFi, Illumina and HiC for it's assembly.

The ragtag.py scaffold parameters were -t 8 -r -g 1 -m 100000000. This is what other groups had used for the same species.

When I compare the reference to the scaffolded assembly, there is a large gap on the initial portion of chromosome 3 (~15 Mb) that is not covered almost at all by the HiFi or ONT reads, yet the AGP file says there is evidence of linkage across the gap from the flanking segments. A BLAST of the 5' flanking segment yields hits to other chromosomes, but not to chromosome 3. I'm not sure what could be causing this, but my thinking is I should just manually remove this whole arm of the chromosome since it seems to be an artifact. But would you suggest using ragtag correct first to see if this resolves the issue?

There is the same issue that occurs for a smaller gap on another chromosome (~4 Mb). Also, without the gaps on on these two respective chromosomes, then their length corresponds better to the reference assembly chromosome lengths (as well as other high quality reference assemblies in the same species).

Dotplots confirm strong collinearity between the reference and scaffolded genomes with the exception of these two large gap regions.

Here is the example of the samtools coverage output at chromosome 3 showing the lack of coverage on the gapped region and minimal coverage for the region that is not aligning to chromosome 3

Screen Shot 2023-11-01 at 1 55 19 PM
nickgladman commented 1 year ago

Also, I should note that both of these chromosome gaps are not present when I filter the draft assembly for contigs that are over 100 kb in size prior to scaffolding.

malonge commented 1 year ago

Hi Nick!

Thanks for reaching out. Yeah, it could be an artifact caused by a spurious alignment to the reference. You can manually correct it or I suggest trying RagTag with nucmer instead of Minimap2. I don't think correct would help, in this case.

You can also look at the filtered and processed alignments by adding --debug. You may be able to see the offending alignment in the merged paf file. Let me know how it goes!

nickgladman commented 1 year ago

Hello. Thank you for the advice in using nucmer (including a --debug run). It seemed to work based on the dotplot files that were generated from the .delta files, but these dotplots show all the pre-chromosome assigned contigs. However, when I extract the gap information from the fasta file, it seems like at least one of the large gaps are still present.

I'm not certain why the .delta is showing discrepancies and I am not familiar with interpreting the debug files. Is there manual for understanding the debug outputs or would you be willing to take a look? Thanks again.

Here is the dotplot generated from the nucmer .delta output:

Screen Shot 2023-11-09 at 2 35 10 PM
malonge commented 1 year ago

Thanks for following up. Unfortunately, I don't have docs for the debug files, though I can give a brief description:

  1. "unfiltered" shows raw alignments before filtering
  2. "filtered" shows unique anchor and mapq filtered alignments
  3. "merged" shows alignments after merging filtered alignments that are close to each other.

Ultimately the merged alignments are what's directly used to determine order and. orientation.

Perhaps you could provide a dotplot so I could better understand the nature of these two errors? Is it that they are false joins or they are correct joins but the gap length is wrong?

Thanks, Mike

nickgladman commented 1 year ago

Thank you. I will look into further. Here are dotPlots:

DotPlot using fasta files (aligned with minimap2). This one has the chromosome 3 and 10 issue:

Screen Shot 2023-11-15 at 11 37 47 AM

DotPlot using ragtag scaffold nucmer .detla file:

Screen Shot 2023-11-09 at 2 35 10 PM
nickgladman commented 11 months ago

Hello again. Just to be sure this wasn't an issue with this particular assembly, I've ran the same pipeline on another assembly (with similar quality metrics) and it also yielded the same results: the nucmer-aligned .delta file shows a nice contiguous alignment compared to the reference (with individual contigs and not the chromosome-assigned numbering), but there are large gaps present within the ragtag scaffold output fasta.

Here is the script I'm using for both assemblies:

ragtag.py scaffold \
-o directory_name \
-r \
-g 1 \
-m 100000000 \
--aligner /path/to/conda/env/bin/nucmer \
/path/to/high_quality_ref.fasta \
input.fasta

I've tried different --nucmer-params as well, but they don't seem to fix the problems. I'm still working through interpreting the debug files.

nickgladman commented 11 months ago

Hello once more. I a fit of desperation I tried to increase the nucmer params -l and -c waaay up and that solved the issues with the assembly. Basically, it needed to be bumped up to at least -l 1000 -c 1000 in order to scaffold against the reference without any significant artifacts. And really it seems that -l 1000 -c 10000 might have yielded the best results. So sorry for the confusion, but I just didn't think pumping up nucmer from its defaults by that much would be the solution; perhaps that is why the there was discord in the .delta file and the scaffolded fasta: nucmer was working but I was just throttling the final output parameters. And for people who happen across this solution in the future: this is a plant genome that has a composition of ~70% repetitive elements.