marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

Don't have maternal and paternal sequence data, how to build hap-kmer #111

Open socialhang opened 10 months ago

socialhang commented 10 months ago

Hi,

Due to the nature of our system we are unable to acquire the parents of our sequenced individual, we sequencing a lot of children data. What should I do to build hap-kmer using these children NGS data?

best wishes! hang

arangrhie commented 9 months ago

Hello,

This is an interesting question, not so easy to answer though.

  1. Using imputed phased variants If you have a collapsed assembly from the child (or children), you could potentially re-map the reads to the assembly (assuming it's nicely polished as well, and near-T2T), and phase the variants called from short-reads using long-reads (HiFi or ONT) or long-range reads (Strand-Seq or HiC). There are several tools that help doing the phasing part. Once the het variants are phased, you could make two haplotype consensuses using bcftools consensus.

Then, you could collect kmers from each haplotype assembly, and use the $MERQURY/trio/hapmer.sh with -no-filter option to get pseudo-haplotype specific kmers and use that to evaluate the original collapsed assembly.

It's not guaranteed that a 'hapmer' is consistent across all chromosomes though, and likely be inconsistent between chromosomal arms if lacking long-range information. Lack of HiC coverage could miss a few true heterozygous variants, especially those between blocks in long stretches of homozygosity. Also, there are so many assumptions made here. There may be regions where it is not easy to map with short-reads, therefore the variant calling accuracy will be poor in those region. I'd only use the phase block information and switch error rate results form Merqury, and not put too much on the blob-plots.

If you already have a diploid assembly, it will be counter intuitive to re-use the assembly specific kmers for evaluation, so unfortunately no, I don't have a good answer.

  1. Using population specific kmers If the parents come from different populations (i.e. subsepcies), and there are some genome sequencing reads available for those, you could try using those population data kmers as the 'parental' kmer db. Run $MERQURY/trio/hapmers.sh with each kmers from the population merged with meryl union-sum.

  2. Sex chromosome limited kmers If your genome follows XY or WZ system, you could use the homogametic (e.g. XX) vs. heterogametic (e.g. XY) samples to extract kmers that are only present in the XY samples. This is not perfect, nor ideal to evaluate the whole genome, but should allow to get an estimate of the Y (or W) specific sequences.

Also linking https://github.com/marbl/merqury/issues/6

Best, Arang