A genome with extremely rich simple repeats

YouxinZhao commented 3 months ago

The genome size of the species I'm assembling is 2.6Gb, and the ONT data comes from dorado sup v5.0.0 (R10). When I used 167GB data (>Q7, >10Kb) with "Nanopore-UL-May2022.conf", I got an assembly of 2.2Gb with a Contig N50 of 7.7Mb.

However, when I used "Nanopore-Phased-R10-Ast-Nov2022.conf", I only got an assembly of 3.7GB with a Contig N50 of 3.0Mb. After filtering the data with a minimum quality of Q20, I obtained 88Gb of data. When I used "Nanopore-Phased-R10-Ast-Nov2022.conf", i got an assembly of 3.6GB with a Contig N50 of 2.7 Mb.

The genome has extremely rich simple repeats I have 60GB HiFi (220Kb contig N50 for HiFi data+UL data using HiFiasm) ,400GB HiC data and trio data. How should I adjust my assembly strategy?

YouxinZhao commented 3 months ago

Should be "However, when I used "Nanopore-R10-Fast-Nov2022.conf", I only got an assembly of 3.7GB with a Contig N50 of 3.0Mb. After filtering the data with a minimum quality of Q20, I obtained 88Gb of data. When I used "Nanopore-R10-Fast-Nov2022.conf", i got an assembly of 3.6GB with a Contig N50 of 2.7 Mb."

YouxinZhao commented 3 months ago

How should I adjust the shasta parameter to reduce genome size and improve genome continuity for R10 data

paoloshasta commented 3 months ago

Are you using Ultra-Long (UL) reads or regular R10 reads? Are you trying to obtain a phased assembly or just a haploid assembly?

Please attach AssemblySummary.html for the three Shasta assemblies you mentioned. You are at relatively low coverage, and so the haploid N50 of 7.7 Mb may be close to what you can do with these reads for a genome with high repeat content. But I might be able to give more input after I look at those assembly summaries.

YouxinZhao commented 2 months ago

Thank you very much for your reply. The data was Ultra-Long (UL) reads.

HTML_file.zip 132GB data (>Q7, >10Kb) with "Nanopore-UL-May2022.conf " 132GB data (>Q7, >10Kb) "Nanopore-R10-Fast-Nov2022.conf" 83GB data (>Q20, >10Kb) "Nanopore-R10-Fast-Nov2022.conf" 83GB data (>Q20, >10Kb) Phased-R10-Fast-Nov2022.conf 83GB data (>Q20, >10Kb) "Nanopore-UL-May2022.conf" 76GB data (>Q20, >10KB) HERRO correct "Nanopore-UL-May2022.conf"

For the genome, the hifiasm(HiFi+ONT+trio), verkko(HiFi+ONT+trio), hifiasm+herro (ONT), can not even get a normal genome assembly (Contig N50<500KB, Small assembly (500MB for Verkko), high D value of BUSCO). Nextdenove usually take a month to assemble the Genome, and have a lower genome size and contig N50 than Shasta. The Shasta have the best performance.

Our aim is to get the T2T genome for the species, The sequencing data will further increase in the future. However, the individual weight of this species is too small and the sequencing yield is less than 7G/cell (N50 >50kb) I have another question, can we use individuals from sibling families for Shasta assembly.

Looking forward to your reply.

paoloshasta commented 2 months ago

Given that these are Ultra-Long (UL) reads, it is not surprising that the best assemblies are the 3 done with the UL-specific assembly configuration (Nanopore-UL-May2022). The N50s you got for those 3 assemblies are 12 Mb, 11 Mb, and 8 Mb. These are respectable numbers, considering that you are working at very low coverage (76 Gb, 83 Gb, and 126 Gb for those three assemblies), and that you are working on a genome with high repeat content.

Take a look at the comments of the beginning of that assembly configuration (see here or use shasta --command listConfiguration --config Nanopore-UL-May2022). The comments describing the conditions under which that assembly configuration was tested include:

# - Coverage 60x to 80x.

For your 2.6 Gb genome, the above numbers correspond to coverage between 29x and 48x, well below the value suggested in the comment. And we know that assembly contiguity (N50) is very sensitive to coverage. It is likely that if you increase coverage by a factor at least 1.5 your assembly contiguity will improve significantly. So doing more sequencing is probably your best bet.

Shasta does not have functionality to use trio or family information. However GFAse can use trio information or HiC/PoreC reads to improve the contiguity of Shasta assemblies, and this might be a useful option for you.

As you may already know, Oxford Nanopore has recently announced a new sequencing process that generates UL reads of much better accuracy. We are working on creating Shasta assembly configurations that work well with these new reads. Rather then doing additional sequencing with your existing sequencing pipeline, it may be better for you to switch to this new one.

YouxinZhao commented 2 months ago

The newest sequencing pipeline in china is the R10.4.1 with basecalling model of dorado sup v5.0.0. Is you mean this pipeline? Our ONT data was from the pipeline. I also have another question. Could i mix the ONT data and Hifi data when using "Nanopore-UL-May2022.conf"

paoloshasta commented 2 months ago

I am not familiar with details of the sequencing and basecalling pipelines, but I think the ULK reads announced by ONT in the page I linked above involve sequencing changes in addition to basecalling changes. ONT quotes using the dna_r10.4.1_e8.2-400bps_sup@v5.0.01 basecalling model and read accuracy Q25.65 for these reads.

Mixing ONT and Hifi data in a Shasta assembly is not likely to give satisfactory results. Shasta assumes that the input consists of a homogeneous set of reads, all with the same length and accuracy statistics.

YouxinZhao commented 1 month ago

Now, i have increased 30 Gb data with the same sequence pipline (R10.4.1 and Herro corrected). However, the contig N50 was only 3Mb -8Mb (Nanopore-UL-May2022) When I extracted the previous sequence from the new correct sequences, contig n50 reached 13Mb.

13 Cells > 10k + 7 Cells 40k ~(90 GB) 13 Cells > 10k + 7 Cells 10K ~(108GB) 13 Cells > 10k （~78 GB）

html.zip

paoloshasta commented 1 month ago

Based on my personal experience, N50s in the range you are seeing are typical of the relatively low coverage you are working with. 108 Gb for a 2.6 Gb genome is about 42x, and in my experience you need to be around 60x to 80x to get N50s in the tens of Mb range.

We are working on assembly configurations for the higher quality reads as announced by ONT in May. We will have two assembly configurations optimized for raw reads and Herro-corrected reads respectively. They use the new experimental Shasta Mode 3 assembly (phased diploid assembly), and we hope to release them soon. But even with those configurations low coverage will remain an issue for you.

paoloshasta / shasta

A genome with extremely rich simple repeats #29