paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.
https://paoloshasta.github.io/shasta/
Other
66 stars 9 forks source link

rebasecalling highly effective #6

Closed colindaven closed 8 months ago

colindaven commented 1 year ago

Hi @paoloshasta

following your suggestions, we re-basecalled an older Q10 dataset (best shasta assembly 3.7 MB N50) yesterday and created a new assembly. The Nanopore-Plants-Apr2021 was the best conf in both cases.

The new assembly has a 13MB N50 and shows better assembly stats in every way. This is also plant data.

Edit - using the may-2022 config bumped the N50 up to 27 MB - which is a fantastic improvement.

We're pretty happy with this boost, so thanks for the suggestion.

cheers, Colin

paoloshasta commented 1 year ago

Nice! If at some point you have a representative plant R10 dataset that you can share, I could work on creating a new assembly configuration for R10 reads, specialized for plant genomes. I am pretty sure the Nanopore-May-2022 configuration is sub-optimal for R10 reads, even though you are getting a very nice assembly with it.

colindaven commented 1 year ago

Ah, this is an R9.4.1 dataset from early 2020, so not R10.4.1 or Q20 I'm afraid. We call them Q10 datasets internally.

If we get access to unrestricted files I'd be happy to pass them on, however this data is likely to stay restricted for some time AFAIK. Maybe we can share some alternative reads from a smaller genome - Arabidopsis - when it becomes available.

paoloshasta commented 1 year ago

That would be nice, and a smaller genome makes things easier when optimizing an assembly configuration.

paoloshasta commented 1 year ago

I am closing this due to lack of additional discussion. Feel free to open a new issue if new topics emerge.

COMInterop commented 1 year ago

Hello,

I have R9.4.1 plant data basecalled with Guppy 6 in HAC mode. I wonder if recalling in SUP mode would be worth the trouble. Because after I like to phase with GFAse. Can you offer me an opinion please?

Thank you!

paoloshasta commented 1 year ago

It would definitely be worth the trouble. I have consistently seen a big improvement in quality when switching from the default mode to SUP. I know re-basecalling is a big and painful undertaking, but I strongly suggest that you do it. And the Shasta assembly configurations for phased diploid assembly are optimized for SUP reads, so I don't know how well they would work with HAC reads.

You could also consider switching to R10 reads. I know ONT is trying to switch everybody to R10, but of course it might not be possible for you to regenerate the reads.

Keep in mind that Shasta phased assembly and GFAse are hardwired for diploid genomes. If the plant genome you are working on has higher ploidy, you will probably not get a good assembly.

I am working on new developments in Shasta that should help with higher ploidy, but that work is not ready for prime time.

COMInterop commented 1 year ago

Thank you. It is diploid. The pseudohaploid drafts are good, N50 10-11Mb. I try with UL-phased-may2022 but I am not clear if I have a nice result. I post summary here and then try to repeat basecalling. I can start a new thread if I have another questions, thank you!

Shasta assembler.pdf

paoloshasta commented 1 year ago

Do you have Ultra-Long (UL) reads? If so, use Nanopore-UL-Phased-Nov2022 (or Nanopore-UL-Phased-May2022). If your reads are not UL reads, use Nanopore-Phased-May2022.

Either way, if you have a lot of coverage use --Reads.minReadLength or --Reads.desiredCoverage to bring coverage in the range expected by these assembly configurations - that is, around 60x.

Yes, it is best to create a new issue if new discussion topics arise.

rlorigro commented 1 year ago

Hi @paracontias, if you are planning to run GFAse, I think the same logic applies, you will want the best possible sequence quality in order to get the best mappings for whichever phasing data type you are using. Let us know in the GFAse repo if you need any help with that.

paoloshasta commented 8 months ago

I am closing this due to lack of discussion. Please create a new issue if additional discussion topics arise.