Phased haplotypes - Githubissues

jiadong324 commented 4 weeks ago

Dear author,

Previously, I used hapdup to get the diplod assembly based on Assembly.fasta. Dose shasta new version support diploid assembly?

paoloshasta commented 4 weeks ago

Yes, there are several Shasta assembly configurations to request a phased assembly. The assembly configuration to be used depends on the type of reads you are using. See the table in this page, which summarizes the main assembly configurations currently available.

So for example for a phased assembly with the latest high quality ONT reads as announced by ONT in May 2024 (r10.4.1_e8.2-400bps_sup) you could just use one of the following commands:

shasta --input MyReads.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024
shasta --input MyReads.fastq --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024

The first one applies to error-corrected reads using Herro, and the second one applies to raw reads that did not undergo error correction. The input can be either in fastq or fasta format.

If the information in that table is not sufficient, please feel free to post additional questions here.

jiadong324 commented 2 weeks ago

I used this command shasta --input R10.fa --config Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 --threads 6, but the output Assembly.fasta is only 3Gb. I also run the same sample with NAPU (here is the config for shasta) and the result looks good.

Looking forward to your comments. Thanks!

paoloshasta commented 2 weeks ago

Please attach the following files from your assembly:

AssemblySummary.html
stdout.log

Please also post details of the sequencing and base calling process. The configuration you used applies to high accuracy ultra-long reads using the latest ONT sequencing process as announced by ONT in May 2024.

jiadong324 commented 1 week ago

We used R10 with LSK114 kit. Dorado v0.8.2 was used for base-calling.

Here are the files for the assembly. AssemblySummary.html.zip stdout.log.zip

Thanks for your help!

paoloshasta commented 1 week ago

There are at least two differences between your dataset and the demonstration dataset announced by ONT in May 2024:

Your read N50 is 45 Kb after discarding reads shorter than 10 Kb. In contrast, the read N50 of the ONT dataset was almost 120 Kb after read shorter than 10 Kb were discarded. I suspect this is due to the fact the the ONT announcement mentions "SQK-ULK114" which is probably not the same as the LSK114 kit you used. In other words, the ONT dataset had Ultra-Long (UL) reads, and yours does not. I am trying to find more information about that.
In your dataset, after discarding reads shorter than 10 Kb, total coverage was about 78 Gb or about 25x. In contrast, the ONT dataset had almost 120 Gb of coverage after reads shorter than 10 Kb were discarded.

The Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024 assembly configuration is optimized for datasets comparable to that ONT dataset, and this is the reason you get an unsatisfactory assembly. Unfortunately we have not yet had the time to create an assembly configuration that works well with shorter reads like you have. However I expect that your assembly results will improve significantly if you can increase coverage to bring it close to the coverage of that ONT dataset. The phased N50 will be short due to the short reads, but I expect that much more sequence will be assembled diploid.

paoloshasta commented 1 week ago

Yes, after some more research, the kit you used has the same accuracy of the ONT May 2024 dataset, but does not produce Ultra-Long (UL) reads. Is it possible for you to switch your sequencing process to UL reads? But even if you can't do that, just increasing coverage should help.

jiadong324 commented 1 week ago

Thanks for your kind suggestions. We do have some UL-ONT generated by R10 with ULK114 kit. I could test more on these samples. For my current short N50 data (R10, ULK114 kit), which configuration file do you recommend?

paoloshasta commented 1 week ago

The one you are currently using, Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024. That should work with shorter reads if you increase coverage to around 35x or more, but even in that case the phased N50 would not be very long, and a significant amount of sequence would still probably be assembled haploid. You could also try the assembly configuration for phased assembly of older R10 reads, Nanopore-Phased-R10-Fast-Nov2022. But your current coverage is probably too low for that configuration too.

We have not yet had the time to create an assembly configuration optimized for your current reads, but in my personal experience a high quality phased assembly with ONT reads requires at least 40x coverage (20x per haplotype).

jiadong324 commented 4 days ago

I will try on higher coverage data. Thanks!

paoloshasta / shasta

Phased haplotypes #40