Closed dcopetti closed 3 years ago
Hi there,
RagTag first looks for any non-syntenic regions between the reference and query genomes. For each contig, RagTag will try to merge all of the alignments for that contig into a single larger alignment. But some adjacent alignments may not be merged because they are more than 100 kb apart (-d
), because they are opposite strands, or because they align to distinct reference sequences. The spaces between non-merged alignments are candidates for breaking. Without providing reads (-R
), ragtag would just break at these points.
If one provides reads, the reads will be aligned to the query sequences. Then, for each candidate breakpoint, ragtag will look within a specified window size around the candidate breakpoint (-v
) for regions of high or low coverage. The high/low coverage thresholds can be adjusted with --max-cov
and --min-cov
, respectively. If there is a position in the query sequence that exceeds these thresholds, that will become the new breakpoint.
The default parameters tend to be very sensitive. In your case, you may consider lowering -d
, -v
, --min-cov
, and rasing --max-cov
. Sometimes, I like to set --max-cov
to an exceedingly high value so that RagTag will only break at points of low coverage, since high coverage often does not imply a misassembly.
Thanks
p.s. you can run with --debug
to get more information about every potential breakpoint
Hi,
Thank you for this clear explanation. Then, if the correction step is driven by the WGA, any discrepancy (biological or mi-assembly) in either assembly will drive the fragmentation of the query assembly, only refined by the read alignment.
This is good to know, since my reference is a closely-related species and was assembled from Sanger data (and fosmids - eons ago). At this point, I am actually dropping the correct
option since I see very good read support on my assembly (also: no SVs found by cuteSV) and most of the impactful variation may be biological.
Thanks a lot!
Dario
Hello,
Thank you for developing this software - very easy to install and run.
I am using it to place contigs of a ~230 Mb plant assembly (28x HiFi data, assembled with hifiasm, I have mostly p-contigs) using a reference. To make sure I don't have misassemblies I run the correct module with the same reads I had for the assembly. The command was
ragtag.py correct -t 45 --inter -j ctg_below_100kb_list -o HalH2_ragtag_corr2 -R readsQ20.fa -T corr reference.fa assembly.fa
and the new fasta has 45 more contigs - OK with me.I went looking at the regions where the cuts are made and I don't see big issues: a good coverage and uniform distribution of reads across the site. There is only one case (first slide) where I agree that the region is questionable (though at a dotplot the region is very simple and the structure is confirmed by collinearity with the reference), but all the other cases seem like any other region to me. What are the criteria that are applied to insert those cuts? Thanks, Dario
RagTag_cuts_HalH2.pdf