marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

QV of +inf #58

Closed foreignsand closed 2 years ago

foreignsand commented 2 years ago

I think I must be doing something wrong.

I created a meryl db using the paired reads from an Illumina run and then ran Merqury as described and I'm getting a quality value of +inf. Do you have any suggestions about what I might be doing incorrectly?

$MERQURY/best_k.sh 16599 returns a value of 11.9916; these are mitogenome assemblies, so the genome length is very short.

I then run the following lines of code [I've attached the resulting merged meryl db and my assembly fasta, the original fastq files were too large to be uploaded here]:

meryl k=11.9916 count output OBGP200901_r1.meryl OBGP2009001_R1_trimmed.fastq.gz
meryl k=11.9916 count output OBGP200901_r2.meryl OBGP2009001_R2_trimmed.fastq.gz
meryl union-sum output OBGP2009001_genome.meryl OBGP200901_r*.meryl
$MERQURY/merqury.sh OBGP2009001_genome.meryl OBGP2009001.asmbly.fasta OBGP2009001_eval

The quality value returned in the OBGP2009001_eval.qv file is +inf. This can't be correct.

Many thanks for any and all help.

files.zip

arangrhie commented 2 years ago

Hello @foreignsand ,

The suggested kmer size with best_k.sh is the minimum kmer size that allows kmer collision within the given error rate.

Mitochondrial genomes are relatively easy to assemble without base errors, and I saw few hifi MT assemblies coming with no error kmer in even with k=21. You could be more stringent and try with k=31 and see if you still see Inf. This means there are no base errors found from k=31.

foreignsand commented 2 years ago

Thanks so much, Arang!

I reran with a k of 31 and excluded k-mers with frequency < 100 as you all did in Formenti et al. (2021) and got a QV of 32.8366. In the paper I just mentioned, it states that a QV of 41.25 indicates approximately 1 base calling error per assembly, what does a QV of 32.8366 suggest? This was an assembly from reads that were shotgun-sequenced on an Illumina HiSeq.

Thanks again!

arangrhie commented 2 years ago

I am not sure where you found the frequency < 100 cutoff in Formenti et al. 2021 paper. Could you provide more details? Perhaps sharing your histogram of the 31-mers might be useful to understand what is going on.

Mito genomes usually exist in multiple copies within a cell, so I expect it would appear in very high sequencing depth, but let's check how the profile looks like first. Also, it is important to note that if the Illumina reads were obtained from targeted sequencing and not WGS, the assumption on error profile would be different.

In any case, the QV calculation in Merqury does not require a frequency cutoff as there might be some true low frequency kmers missing unless the sequencing depth was really high. Even if it was high, I would caution to use a frequency filter above 10x.

The *_only.wig track is loadable to UCSC or IGV. You could check where the error kmers are present on your assembly. If there was a misassembly, or base error(s), they usually form a cluster.

Regarding the QV, it is obtained in phred quality score. Q30 translates to 1 error in 1000 bp, Q40 as 1 in 10000 bps.

foreignsand commented 2 years ago

Thanks again for the response! This was where the idea to use frequency < 100 came from: "The tool was run with default parameters and k-mer size 31, but given the approximate 50–100× coverage of VGP datasets, k-mers with frequency < 100 were removed, to limit QV overestimation due to the inclusion of nDNA k-mers."

Oddly, removing the k-mer cutoff doesn't really have an impact on the QV, and I'll exclude or decrease the cutoff as you've advised.

The Illumina reads were from WGS, there was no targeted enrichment or amplification.

I realize that I'm confused about how to evaluate the assembly. Should I be providing the fasta containing all of the assembled contigs from my SPAdes output, or just the contig for the mitogenome? If I pass just the mitogenome contig to Merqury, I still get +inf for a QV and an empty _only.wig file. Passing a fasta containing all assembled contigs produces an _only.wig file where the node that contains the mitogenome sequence—NODE_2 in this case—has no data. I've appended that .wig filename with a .txt extension so I could upload it here:

OBGP2009001.asmbly_only.wig.txt

Perhaps only the contig with the assembly I care about, the mitogenome, should be what I provide to Merqury?

Apologies for my confusion and thanks again for your help!

arangrhie commented 2 years ago

I think now I understood why the <100 filter was applied :) I guess in your case, as it is a WGS, and you are only interested on a subset - the mito - then I think you could apply a more stringent criteria. Given that your <100 cutoff resulted in Q32.8, it seemed to me that the cutoff was too stringent.. In case you haven't evaluate the kmer spectrum yet, see where the nDNA spectrum is, and go from there.

In general, Merqury assumes everything in the sequencing set to be assembled. So providing the entire fasta file and evaluate with the full kmer db will be my recommendation. There is an additional .qv file generated, which contains qvs per each contigs/scaffolds. with Inf+ QV, you can claim that your MT genome has no bp error when measured with k=31 using Merqury. You could apply a more stringent criteria, as above mentioned, and provide as additional supporting evidence for this.

Note that Merqury QV only reflects bp level accuracy, and does not detect structural rearrangements. For those, another examination on the alignment pileup might be necessary.

foreignsand commented 2 years ago

Thanks so much! It's interesting because changing the cutoff value had no effect on the QV. Only the k-mer size. But yes the mitogenome contig still had a +inf QV with cutoff <100 and k of 31 and I found that value in the other qv file you mentioned. I'll use the language you recommended, that's very helpful!

In Geneious I'm mapping reads to the assembled mitogenome I've been using as a guinea pig for this, but if you have any other recommendations for how to evaluate structural rearrangements I'd love your input. There are so many possibilities, but they generally seem to be for evaluating nuclear assemblies and that seems to be overkill for evaluating mitogenomes. Even indexing the reference assembly in bwa and mapping the reads to the reference seems a bit excessive to me as it's so time-consuming and storage-intensive. I just can't seem to find another, easier way to go about it, and I have over 300 mitogenomes to evaluate.

Thanks again!!