Assembly of complex plant regions

jav6745 commented 11 months ago

Hello Paolo, thanks for this great tool.

I produced ONT reads for some target regions (ranging from 40kb-1Mb) of a diploid homozygous plant and I tried to assemble them using Shasta with the config R10 Fast of November2022. These regions are complex including many repetitive elements and gene duplications. I got a contiguous assembly with almost one contig for each region, but when i map back the reads to the assembly, I get some weird results.

For many of the regions the reads map well and I obtain no SNPs or SVs. However, for some other regions I get half of the reads mapping well and half mapping differently (see image). I doubt they are heterozygous areas (although they could eventually appear) as this does not occur on other regions from the same genome. I think it could be more a problem of segmental duplications that have not been well assembled. What do you think? There are some parameters that I should modify to solve this problem?

Thanks in advance.

paoloshasta commented 11 months ago

In the following I am assuming that the assembly configuration you used is the one for haploid assembly (Nanopore-R10-Fast-Nov2022) and not the one for phased diploid assembly (Nanopore-Phased-R10-Fast-Nov2022). This is the right thing to do if you believe you have a completely homozygous genome, even though it is diploid.

It is very possible that some segmental duplications were not assembled, and that is a very reasonable explanation of what you are seeing. This could be due to the difficulty of resolving repeats, but it could also be a result of the bubble removal process that takes place in the Shasta haploid assembly process: bubbles could have been present before removal not because of heterozygosity but as a result of the repeats.

Can you attach AssemblySummary.html and stdout.txt for the assembly? Also, if possible please take a look at the assembly in Bandage. Does it look "fragmented" or "messy"? Would it be possible for you to attach a Bandage screenshot of the assembly? Ideally if would also be great if you could highlight in that plot some of the regions that you believe may have missing copies, or otherwise list some of those regions. Alternatively, you could attach Assembly-BothStrands-NoSequence.gfa (ideally listing some of the regions with missing copies), and I can take a look myself.

I am working on new assembly methods in Shasta, still based on the read graph and the marker graph, that should do a much better job assembling hard regions and/or repetitive genomes. This is still work in progress, but it could significantly improve the situation in your case when it becomes available.

paoloshasta commented 8 months ago

I am closing this due to lack of discussion. Feel free to create a new issue if additional discussion topics emerge.

paoloshasta / shasta

Assembly of complex plant regions #18