marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

Completeness stats too high, filt count much too high... #37

Closed SheepwormJM closed 3 years ago

SheepwormJM commented 3 years ago

Hi, Thanks again for Merqury, and all your help so far.

I have two different Illumina read sets (one straight sequenced (probably with some PCR) and one with WGA). I ran Merqury with them both together, then separately. The non-WGA completeness stats look sensible to me - about 50% (so, I'm interpreting this as: about 50% of the distinct kmers with counts > the .filt file cut-off present in the read data are also found in the genome assembly). But the WGA completeness stats are much higher - about 86%, yet the spectra plots are very similar. The filter count for the WGA is set at about 237. I re-plotted the data, extending the x-axis to 250 to check if there was something really odd going on - a massive peak at this multiplicity, but nothing there.

So I moved all of my output files into a new directory, copied the .filt file and changed it to 4 (the count cut-off for the non-WGA, when both read databases were used together the cut-off had been 6). And re-ran Merqury. It ran fine, but the completeness stats are empty.

Is there any particular reason why it might be doing this, or any way that I can over-ride it?

This is the spectra plot: image

And this is an extended x-axis one: image

Also, could you please clarify for me - when you say distinct kmers in the read data/genome - this is a kmer sequence. It doesn't matter how many times it is present in the data, it is only counted once? And that if a kmer is present in both the genome assembly and the read database, but at a low multiplicity in the read database (so that it is excluded) it is not included in the assembly total kmers?

Thanks! Jenni

arangrhie commented 3 years ago

Hi Jenni,

What is WGA? Whole genome sequencing reads?

The .filt file gets created with build/filt.sh and is not used downstream.. I kept this as a record. Sorry for the confusion! You can simply run the below to get the completeness:

meryl greater-than $filt $read.meryl output $read_solid
meryl intersect $asm.meryl $read_solid output $asm.solid.meryl

TOTAL=`meryl statistics $read_solid | head -n3 | tail -n1 | awk '{print $2}'`
ASM=`meryl statistics $asm.solid.meryl | head -n3 | tail -n1 | awk '{print $2}'`
echo -e "${asm}\tall\t${ASM}\t${TOTAL}" | awk '{print $0"\t"((100*$3)/$4)}' >> $name.completeness.stats

using your own $filt threshold. asm is your assembly, read your read set. read_solid.. you can name it in your way.

Also, could you please clarify for me - when you say distinct kmers in the read data/genome - this is a kmer sequence. It doesn't matter how many times it is present in the data, it is only counted once? And that if a kmer is present in both the genome assembly and the read database, but at a low multiplicity in the read database (so that it is excluded) it is not included in the assembly total kmers?

Yes, your understanding is correct. The completeness in Merqury is currently not using the multiplicity in the read data, and only uses the distinct kmers. Using the copy number estimate is a future work, and is in development here but still in its early stage. I'd recommend to try the above script and use that for your completeness metric until we have a stable version.

Thanks, Arang

SheepwormJM commented 3 years ago

Thanks very much Arang, I'll give that a go now. The WGA stands for Whole genome amplification. We have very little DNA in our worms. and I think they had felt that not using it hadn't worked very well so did a second worm with WGA. All the best, Jenni

SheepwormJM commented 3 years ago

That's it run perfectly! Thanks Arang!!