Closed colindaven closed 11 months ago
Nice! If at some point you have a representative plant R10 dataset that you can share, I could work on creating a new assembly configuration for R10 reads, specialized for plant genomes. I am pretty sure the Nanopore-May-2022
configuration is sub-optimal for R10 reads, even though you are getting a very nice assembly with it.
Ah, this is an R9.4.1 dataset from early 2020, so not R10.4.1 or Q20 I'm afraid. We call them Q10 datasets internally.
If we get access to unrestricted files I'd be happy to pass them on, however this data is likely to stay restricted for some time AFAIK. Maybe we can share some alternative reads from a smaller genome - Arabidopsis - when it becomes available.
That would be nice, and a smaller genome makes things easier when optimizing an assembly configuration.
I am closing this due to lack of additional discussion. Feel free to open a new issue if new topics emerge.
Hello,
I have R9.4.1 plant data basecalled with Guppy 6 in HAC mode. I wonder if recalling in SUP mode would be worth the trouble. Because after I like to phase with GFAse. Can you offer me an opinion please?
Thank you!
It would definitely be worth the trouble. I have consistently seen a big improvement in quality when switching from the default mode to SUP. I know re-basecalling is a big and painful undertaking, but I strongly suggest that you do it. And the Shasta assembly configurations for phased diploid assembly are optimized for SUP reads, so I don't know how well they would work with HAC reads.
You could also consider switching to R10 reads. I know ONT is trying to switch everybody to R10, but of course it might not be possible for you to regenerate the reads.
Keep in mind that Shasta phased assembly and GFAse are hardwired for diploid genomes. If the plant genome you are working on has higher ploidy, you will probably not get a good assembly.
I am working on new developments in Shasta that should help with higher ploidy, but that work is not ready for prime time.
Thank you. It is diploid. The pseudohaploid drafts are good, N50 10-11Mb. I try with UL-phased-may2022 but I am not clear if I have a nice result. I post summary here and then try to repeat basecalling. I can start a new thread if I have another questions, thank you!
Do you have Ultra-Long (UL) reads? If so, use Nanopore-UL-Phased-Nov2022
(or Nanopore-UL-Phased-May2022
). If your reads are not UL reads, use Nanopore-Phased-May2022
.
Either way, if you have a lot of coverage use --Reads.minReadLength
or --Reads.desiredCoverage
to bring coverage in the range expected by these assembly configurations - that is, around 60x.
Yes, it is best to create a new issue if new discussion topics arise.
Hi @paracontias, if you are planning to run GFAse, I think the same logic applies, you will want the best possible sequence quality in order to get the best mappings for whichever phasing data type you are using. Let us know in the GFAse repo if you need any help with that.
I am closing this due to lack of discussion. Please create a new issue if additional discussion topics arise.
Hi @paoloshasta
following your suggestions, we re-basecalled an older Q10 dataset (best shasta assembly 3.7 MB N50) yesterday and created a new assembly. The
Nanopore-Plants-Apr2021
was the best conf in both cases.The new assembly has a 13MB N50 and shows better assembly stats in every way. This is also plant data.
Edit - using the may-2022 config bumped the N50 up to 27 MB - which is a fantastic improvement.
We're pretty happy with this boost, so thanks for the suggestion.
cheers, Colin