small N50 of assembly - Githubissues

csuzhang commented 4 months ago

Hi, I user Shasta v0.11.1 to assemble a R10.4 dataset, the N50 of the assembly is only 3.4M, is this right? The dataset was download from s3://ont-open-data/giab_2023.05/analysis/hg002/sup/PAO83395.pass.cram. The parameters used in the assembly: --Assembly.mode2.phasing.minLogP 20 --config Nanopore-Phased-R10-Fast-Nov2022

paoloshasta commented 4 months ago

Please post AssemblySummary.html from your assembly directory.

csuzhang commented 4 months ago

Here is AssemblySummary.html.gz in my assembly directory

paoloshasta commented 4 months ago

I am assuming that this is a human genome. If otherwise, please let me know the expected genome size, and some of the comments below might not apply.

First let's set expectations by looking at the comments at the beginning of the assembly configuration you are using, Nanopore-Phased-R10-Fast-Nov2022. You can see this comments in shasta/conf/Nanopore-Phased-R10-Fast-Nov2022 or via the following Shasta command:

shasta --command listConfiguration --config Nanopore-Phased-R10-Fast-Nov2022

Those comments include the following:

#  - ONT R10 chemistry, fast mode, chimera rate around 2%.
#  - Guppy 6 basecaller with "super" accuracy.
#  - Human genome at low coverage (HG002, one flowcell, coverage around 35x).
#  - Phased diploid assembly.

# Under these conditions, a test run for HG002 produced an assembly 
# consisting of about 2.7 Gb in bubble chains and 0.3 Gb outside bubble chains.
# Of the 2.7 Gb in bubble chains, about 2.0 Gb were assembled diploid and phased. 

# The N50 for bubble chains was about 5 Mb, and the N50 for phased bubbles was about 0.5 Mb.
# This is longer than most genes, which means that for many genes 
# the assembly contains both haplotypes, entirely phased.

You assembly has 2.7 Gb in bubble chains, of which 1.8 Gb assembled diploid, so it is similar to the above in terms of amount of sequence assembled. Your N50 for diploid sequence is 0.3 Mb (not 3.4 Mb as in your initial post). This is less than the 0.5 Mb described in the comments, but the same order of magnitude. Your N50 for bubble chains is 4 Mb, also comparable to the 5 Mb described in the comments.

So overall you assembly is of inferior quality compared to what the comment describe, but not substantially inferior.

It is likely that this is close to the best that can be achieved for a phased assembly at this low coverage and relatively short reads with a 33 Kb N50.

The reason for the low N50 in a phased assembly is that a local coverage drop in one haplotype causes the assembly to temporarily switch to haploid assembly. You can see this if you load your assembly in Bandage.

To obtain a larger N50 you have a few possibilities:

Do a haploid assembly instead of a diploid assembly. For your situation you would use --config Nanopore-R10-Fast-Nov2022.
Use Ultra-Long (UL) reads. UL reads routinely achieve a read N50 around 100 Kb which can be very helpful for assembly.
Switch to the new high accuracy, ultra-long reads recently announced by Oxford Nanopore.

paoloshasta / shasta

small N50 of assembly #28