Unusual performance on R10.4 Dorado data

paoloshasta / shasta

De novo assembly from Oxford Nanopore reads.

https://paoloshasta.github.io/shasta/

Other

74 stars 11 forks source link

Unusual performance on R10.4 Dorado data #24

Closed Mon3trK closed 6 months ago

Mon3trK commented 7 months ago

Dear @paoloshasta , Hi, recently I am assembling a human genome sequenced by R10.4.1 flowcell and basecalled with Dorado (HAC mode). The N50 of the input read is not bad for ~32.0K and the sequencing depth is also decent (~40-50X). I use the command

shasta-Linux-0.11.1 --input {FASTQ_FILE} --config Nanopore-R10-Fast-Nov2022 --threads 80 --assemblyDirectory {OUTPUT_DIR};

to conduct the denovo assembly. But I only got a NG50 of ~3 Mb and total length of 2.7 Gb. The performance of shasta on this data is unusual I think, do you hae any ideas to improve this assembly? Many thanks

paoloshasta commented 7 months ago

Please attach the following files from the assembly directory:

AssemblySummary.html
stdout.log
Binned-ReadLengthHistogram.csv
LowHashBucketHistogram.csv
DisjointSetsHistogram.csv
Assembly-BothStrands-NoSequence.gfa

You can omit the last one if it is too big and it is not practical to upload it. With this I should be able to get some idea of what may be going on.

Mon3trK commented 7 months ago

@paoloshasta Thanks for the quick response, I prefered this to be private, is there an email which I can send the requested files to you ?

paoloshasta commented 7 months ago

Sure, you can use the e-mail address in my GitHub profile.

Mon3trK commented 7 months ago

Hi @paoloshasta, I have sent the files through email. If you have any further question please let me know

paoloshasta commented 7 months ago

I received your files and I am looking at them. However, when I try to reply to your e-mail, I get in return an unusual error message from your Outlook server. Please contact me again using a different e-mail address so I can reply. You don't need to resend the files - I have them.

Mon3trK commented 7 months ago

@paoloshasta Thank you for you help, I have send you an email using a different email address.

Mon3trK commented 6 months ago

Thanks to @paoloshasta I successfully solved the problem. For ONT R10 data, to achieve the best performance, one should use Guppy5/6 for basecalling under the sup basecalling mode if you choose the --config Nanopore-R10-Fast-Nov2022. Baecalling with hac or Dorado are suboptimal for this configuration.

colindaven commented 6 months ago

Hi @Mon3trK , for what it's worth that is not generally true for our datasets, and we have done >20 plant genomes.

We use dorado and sup basecalling as standard, and --config Nanopore-R10-Fast-Nov2022 generally yields best results. If we see poor results, which is rare these days, then we go back to the Nanopore-May2022 config.

I believe guppy is deprecated in favour of dorado so it won't be a good choice for the future.

Mon3trK commented 6 months ago

Hi @colindaven thank you for the heads up.