malonge / RagTag

Tools for fast and flexible genome assembly scaffolding and improvement
MIT License
470 stars 48 forks source link

Ragtag preservationof SNPs and small Indels? #38

Closed Rob-murphys closed 3 years ago

Rob-murphys commented 3 years ago

I have some symbiotic fungal genomes that are notorious for being hard to assembles due to repeat regions, resulting in quite fragmented assemblies with not great N50 /L50 scores.

We have a reference genomes for the species and have used the de novo assemblies along with ragtag to generate substantially improved assemblies, so thank you for this!

My question comes when we want to identify biosynthetic gene clusters within our assemblies (which come from different geographical locations). Within this we want to see how conserved or variable common BGCs might be. Thus the preservation of SNPs and small indels would be important to us. This may be an ignorant questions, but how will using Ragtag (and thus reference guided assembly) affect our ability to do this? If the contig is to dissimilar I assume it is discarded?

Best regards Lamma

malonge commented 3 years ago

Hi Lamma,

RagTag just orders an orients input sequences, so the sequence within individual contigs does not change. All of the ordering/orienting info is available in the AGP file.

While the sequence within contigs does not change, the relative position of contigs obviously does change. Depending on how divergent the reference genome is, there can certainly be errors in this process, especially for less contiguous assemblies. Ultimately it is up to the discretion of the user to decide if reference-guided scaffolding is appropriate for their data. One recommendation is to generate your scaffolds, find your locus of interest, and then use independent data types to try to validate the structure of the locus. For example, you can align long-reads to the region and check for discordant alignments.

I hope this helps!

Rob-murphys commented 3 years ago

Hey Malonge,

Thank you for the reply, this does help :)

One recommendation is to generate your scaffolds, find your locus of interest, and then use independent data types to try to validate the structure of the locus. For example, you can align long-reads to the region and check for discordant alignments.

So you mean use our assembly pipeline that includes Ragtag then align initial long reads to this final assembly (with BWA or something) and look for "discordant alignments". I am not totally sure I know what you mean by "discordant alignments" in this case as we will of course have reads from the whole fasta of long reads that won't map to the locus?

Best regards Lamma

malonge commented 3 years ago

Hi Lamma,

I'll start by reiterating that this is mostly to confirm structural accuracy, not so much the base-level accuracy that will impact SNP/INDEL calling.

Your interpretation is correct, though BWA is not well suited for long-read alignment. Also, be sure to align all of your long reads to the entire assembly, not just the locus of interest. Then you can zoom in to your locus of interest and look for discordant alignments.

The VGP has some nice information about this (what they would call "curation"). Basically, you want to look for continuous alignments and uniform coverage across the locus. Repeats can cause coverage spikes, which isn't necessarily a bad thing if there are long reads comfortably spanning the repeats. Ultimately though, this sort of validation can be nuanced, especially within repeats. The VGP does a nice job of trying to automate some of this stuff. There is also a tool called Asset that tries to automate this.

Rob-murphys commented 3 years ago

Hi Malonge,

Thank you for all this information. I will have to take some time to understand and attempt the processes you describe! However overall from what you describe using RagTag should be fine for our purposes of biosynthetic gene cluster mining as we don't expect there to be any large structural variations! I will certainly however attempt what you describe above!

best regards Lamma

malonge commented 3 years ago

yes if a gene cluster is completely contained within a contig, for example, then ragtag will make no difference. And like I said, you can always find the location of that contig in the AGP file.

Thanks