Single diploid individual quality estimates and completeness

Overcraft90 commented 2 years ago

Hi arangrhie,

My name is Matteo and I have a question concerning the use of Merqury for a single diploid individual. I'm working with a pure strain (Col-0) of Arabidopsis thaliana for which I have available both high coverage HiFi data and Hi-C information.

Once created the db.meryl using the HiFi reads, is it correct to run Merqury using the .fasta file obtained form the partially-phased haplotype assemblies similarly obtained form the same HiFi read set using a long reads assembly tools (e.g. hifiasm)? Or should I use the .fasta for the fully-phased haplotypes generated integrating Hi-C information?

I'm asking because the plots I got look a bit weird to me... Thanks in advance!

arangrhie commented 2 years ago

Hi Matteo,

If HiFi reads are the only source of WGS you have, yes, create a db.meryl from this and use it as the reads.meryl. Ideally, it is recommended to use Illumina WGS, as HiFi reads have biases in certain sequence patterns, thus will favor HiFi assemblies.

You could run Merqury twice;

on the partially-phased haplotype assemblies
on the fully phased assembly

however I wouldn't expect much difference because no parental data are available to confirm the phasing state. Scaffolding with Hi-C wouldn't affect the kmer spectrum unless sequence level modification was performed, such as gap filling or polishing.

How have you run Merqury? How does the plots look like?

Overcraft90 commented 2 years ago

Thanks a lot for the fast answer!

Do you know I had exactly the same thoughts; in fact, I was discussing about that with one of my colleagues to explain him that what I'm getting is most likely the outcome of lacking parental data (Trio binning).

I did run Merqury with partially-phased haplotypes (1), following the output plots:

Merqury-HiFi

And with fully-phased haplotypes (2) — sorry for my bad memory, following the output plots:

Merqury-Hi-C

I did get also the line plots for both the spectra-cn and spectra-asm (for both hap1 and hap2). However, I also get six additional plots (three for hap1 and three for hap2) that I cannot really understand what they are telling. For now I show you these two figures, I can then attach the other six graphs (just to keep things a bit cleaner).

P.S. column 1 = hap1, column 2 = hap2 and column 3 = assembly spectra

arangrhie commented 2 years ago

The plots generated should have a similar naming to this:

nontrio_both.hap1.spectra-cn.fl.png
nontrio_both.hap1.spectra-cn.ln.png
nontrio_both.hap1.spectra-cn.st.png
nontrio_both.hap2.spectra-cn.fl.png
nontrio_both.hap2.spectra-cn.ln.png
nontrio_both.hap2.spectra-cn.st.png
nontrio_both.spectra-asm.fl.png
nontrio_both.spectra-asm.ln.png
nontrio_both.spectra-asm.st.png
nontrio_both.spectra-cn.fl.png
nontrio_both.spectra-cn.ln.png
nontrio_both.spectra-cn.st.png

Plotting type

fl: un-stacked filled plot
ln: lun-stacked ine plot
st: stacked filled plot

What to plot

spectra-cn: copy number spectrum; show portion of kmers present in 1~4 or more copies in the given fasta
spectra-asm: assembly spectrum; show portion of kmers exclusively belonging to each fasta, or shared by both

And the same plots are generated for each fasta files (hap1 or hap2). Plots without hap1 or hap2 are drawn for both fasta files.

I don't see much 'weirdness' around. Seems like the Hi-C version is in a 'diploid' assembly form? It has certainly less missing kmers (read-only, black area) in hap1 compared to the pseudo-haplotype version.

Overcraft90 commented 2 years ago

Yes, correct! Also the Hi-C version is for a diploid assembly.

Thanks for the valuable info, and for the interpretation of Merqury's output. What I meant when I mention about the other three plots for each haplotype was exactly this:

nontrio_both.spectra-cn.fl.png
nontrio_both.spectra-cn.ln.png
nontrio_both.spectra-cn.st.png

Now it's all clear. Thanks again! Matteo

marbl / merqury

Single diploid individual quality estimates and completeness #70