marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

The QV value about T2T will be lower #123

Open zongzone opened 2 months ago

zongzone commented 2 months ago

HI Arang,

One question I have is that I used hifi data on well assembled T2T genomes to predict QV and found in QV between 53-56(Completeness:93). And I found other genomes that are not T2T, some predicted QV over 70(Completeness:97). since ONT data complementary gap is introduced when T2T is assembled, does this affect the QV value? I read many of the questions you answered, one of the library building methods is to perform hybrid library building (illumina+hifi), is there any detailed steps, does this apply to my case?

arangrhie commented 2 months ago

Hi @zongzone , yes it's expected to have more errors in the ONT patched regions. Also if you are using HiFi on your HiFi assembly, it's over estimating the QV because the assemblies inherit the same sequencing bias as in the reads. Here I have described how to build hybrid DBs: https://github.com/arangrhie/T2T-Polish/tree/master/merqury#2-hybrid

The bottom line is to use kmers with frequencies over 1, and merge them. Below is the one-liner from the above link. meryl union-sum [ greater-than 1 IlluminaPCRfree.k21.meryl ] [ greater-than 1 hifi20k.k21.meryl ] output hybrid.meryl

Best, Arang

zongzone commented 2 months ago

Hi arang, I rebuilt the hybrid library and analysed it according to the commands you gave me, and found that the QV of the hybrid library is lower than the QV of the normal hifi prediction, and I'm not quite sure if this is part of the normal situation.

arangrhie commented 2 months ago

That's expected, HiFi only assemblies will have the same kmers only seen in HiFi kmers in low frequency (observed once) that is never seen in Illumina - which are very likely true errors. Merqury makes wiggle files (look for _only.wig or _only.bed), which are the positions of errors flagged by Merqury and look at the read alignments.

LGG02 commented 2 months ago

The bottom line is to use kmers with frequencies over 1, and merge them. Below is the one-liner from the above link. meryl union-sum [ greater-than 1 IlluminaPCRfree.k21.meryl ] [ greater-than 1 hifi20k.k21.meryl ] output hybrid.meryl

@arangrhie Why this has to be union-sum? Can we use union-max? how these change the downstream results? Thank you

arangrhie commented 1 month ago

The frequency counts aren't used in QV estimate in Merqury, so it won't change anything. QV is looking for the presence and absence of the kmer.