Use case: nanopore-only genome assembly

nmflack commented 1 year ago

Hello! I'm looking for assistance interpreting an odd use case of Merqury.

We recently assembled a chromosome-level diploid mammal genome with nanopore data only. The assembly has a BUSCO score of 94.7%, N50 > 100 Mb, an L50 of 8, and it aligned well with a closely related species reference. Sequencing coverage was 63x.

However, the read set has a median QV of 14.51, which is obviously less than expected by Merqury and likely to skew its quality estimate.

With default settings, the combined Merqury QV for the assembly was 45.3 with 97.5% completeness. I've included one of the k-mer plots below.

I also ran best_k.sh with our diploid genome size (2.5 Gb x 2 = 5 Gb) and read error rate (0.035) and reran Merqury with the suggested k=19. The result was Q53.2. Here's the output of meryl statistics for that run, I can also grab the k=21 version if that'd be helpful:

Number of 19-mers that are:
  unique            13758263395  (exactly one instance of the kmer is in the input)
  distinct          22872516750  (non-redundant kmer sequences in the input)
  present          157971738696  (...)
  missing          252005390194  (non-redundant kmer sequences not in the input)

Another tool built for long reads (Inspector) scored the assembly as 97.9% complete with QV 31.3, which I have an easier time believing. Still, I'd like to include an accurate interpretation of our Merqury run in our paper in case others are interested in doing the same.

Would you be willing to share your thoughts on these results? There's a massive number of small k-mers, but it looks like they were largely excluded from the assembly. Homozygosity was high, which could explain the lack of single haplotype k-mers along with switch errors called by Inspector.

Many thanks, Nicole

Supplemental_kmer

arangrhie commented 1 year ago

Hello Nicole, Unfortunately, Merqury is not recommended to use for ONT only assemblies. The bag of k-mers found in the ONT reads are likely to contain systematic errors, which will inflate the QV. I'd recommend to obtain Illumina reads if possible, to further evaluate and polish the genome if the goal is to build a high-quality reference.

Best, Arang

nmflack commented 1 year ago

Hi Arang, appreciate the response. That's too bad; ONT is seeing higher mean quality with their new flow cell chemistry, so hopefully things will be different in the future.

marbl / merqury

Use case: nanopore-only genome assembly #92