malonge / RagTag

Tools for fast and flexible genome assembly scaffolding and improvement
MIT License
470 stars 47 forks source link

criteria for breaking contigs #45

Closed dcopetti closed 3 years ago

dcopetti commented 3 years ago

Hello,

Thank you for developing this software - very easy to install and run.

I am using it to place contigs of a ~230 Mb plant assembly (28x HiFi data, assembled with hifiasm, I have mostly p-contigs) using a reference. To make sure I don't have misassemblies I run the correct module with the same reads I had for the assembly. The command was ragtag.py correct -t 45 --inter -j ctg_below_100kb_list -o HalH2_ragtag_corr2 -R readsQ20.fa -T corr reference.fa assembly.fa and the new fasta has 45 more contigs - OK with me.

I went looking at the regions where the cuts are made and I don't see big issues: a good coverage and uniform distribution of reads across the site. There is only one case (first slide) where I agree that the region is questionable (though at a dotplot the region is very simple and the structure is confirmed by collinearity with the reference), but all the other cases seem like any other region to me. What are the criteria that are applied to insert those cuts? Thanks, Dario

RagTag_cuts_HalH2.pdf

malonge commented 3 years ago

Hi there,

RagTag first looks for any non-syntenic regions between the reference and query genomes. For each contig, RagTag will try to merge all of the alignments for that contig into a single larger alignment. But some adjacent alignments may not be merged because they are more than 100 kb apart (-d), because they are opposite strands, or because they align to distinct reference sequences. The spaces between non-merged alignments are candidates for breaking. Without providing reads (-R), ragtag would just break at these points.

If one provides reads, the reads will be aligned to the query sequences. Then, for each candidate breakpoint, ragtag will look within a specified window size around the candidate breakpoint (-v) for regions of high or low coverage. The high/low coverage thresholds can be adjusted with --max-cov and --min-cov, respectively. If there is a position in the query sequence that exceeds these thresholds, that will become the new breakpoint.

The default parameters tend to be very sensitive. In your case, you may consider lowering -d, -v, --min-cov, and rasing --max-cov. Sometimes, I like to set --max-cov to an exceedingly high value so that RagTag will only break at points of low coverage, since high coverage often does not imply a misassembly.

Thanks

p.s. you can run with --debug to get more information about every potential breakpoint

dcopetti commented 3 years ago

Hi,

Thank you for this clear explanation. Then, if the correction step is driven by the WGA, any discrepancy (biological or mi-assembly) in either assembly will drive the fragmentation of the query assembly, only refined by the read alignment.

This is good to know, since my reference is a closely-related species and was assembled from Sanger data (and fosmids - eons ago). At this point, I am actually dropping the correct option since I see very good read support on my assembly (also: no SVs found by cuteSV) and most of the impactful variation may be biological. Thanks a lot! Dario

malonge commented 3 years ago

yes you are right - the read alignment (and parameter turning) is meant to reduce false positives but it ultimately may not be advisable to use correction for certain projects. I make a note of this in the wiki