marbl / merqury

k-mer based assembly evaluation
Other
275 stars 19 forks source link

Question: Non parental hap-mer identification #6

Open bmansfeld opened 4 years ago

bmansfeld commented 4 years ago

Hello, I'm very excited to test and assess the quality of our phased assembly (FALCON-Unzip -> FALCON-Phase). However due to the nature of our system we are unable to acquire the parents of our sequenced individual. In the preprint you mention:

"Merqury’s methods are extensible to polyploid genomes and it is possible to identify hap-mers using orthogonal datasets (e.g. Hi-C, Strand-seq). To support alternative k-mer classification methods, Merqury is designed to receive any pre-computed hap-mer set as input."

Though this is beyond the scope of Mergury's pipeline could you direct me to methods for creating such a hap-mer set using Hi-C data?

While a little circular (Hi-C was used to phase the genome) It would be great to be able to asses the phasing accuracy using this method.

Thanks, Ben

arangrhie commented 4 years ago

Hi @bmansfeld,

Fabulous question. :) I am very looking forward to have such a tool become available. Unfortunately, I am not aware of any yet available though. I think it's feasible, will need some time to put things together though. Hope a talented person gets inspired and develops such a tool in the meantime.

Will leave this issue open to see if anyone else can chime in.

-Arang

bmansfeld commented 3 years ago

Hey Arang, Hope all is well. I've been toying with this idea and hoped to get some feedback on this. We have a FALCON-Unzip assembly (+purged_dups) which we corrected phase switch errors with FALCON-Phase and Hi-C data.

Basically I aligned raw illumina reads to phase0 and phase1 fastas and classified them classify_by_alignment script here: https://github.com/esrice/trio_binning. This assigned about 20% of the reads to have a better alignment score to one of the phases. So a total of 40% specific.

I converted the bams back to fastqs and then used meryl to count the kmers associated with each of these.

My specific question is if I should first filter these kmer set to be unique (using meryl difference)? or should i use the full set including the shared kmers?

I know this may be a bit circular but I'm trying to define these reads as hapmers to see if classifying by alignment can help identify phase switches and make blob plots with out the real parent data.

Here are the blobs if I A) use meryl difference to identify phase specific kmers before running merqury. image

And B) if I dont use difference : image

Not sure how to interpret. A looks too good to be meaningful? and B) looks too bad to be useful?

If you have any other feed back or things I can try that would be great. If this is really stupid please let me know too. I know it might. I'm thinking about this as a sort of "testing you de novo transcriptome by aligning your reads back to it". If they align well it's kinda what you expect, but if they don't you know something is wrong... Maybe that analogy is not the best for this approach...

Thanks, Ben

arangrhie commented 3 years ago

Hi Ben,

Very interesting question! It took me some time to think about it - and was drowning with other stuffs ... sorry it took me so long to get back.

So what I understand is that you found Illumina reads mapping better to one of the pseudo haplotype assembly, and collected kmers from there. These binned reads aren't globally phased however locally phased, assuming FALCON-Unzip and -Phase did not introduced more errors.

Your attempt A shows that the binning worked, and the kmers uniquely found in the illumina reads are consistent with the assembly used to bin it. I think it is ok to use A not B, to exclude kmers from the homozygous regions for this purpose, however I would be cautious calling them and using them as 'hapmers' for measuring phasing stats with Merqury...

To turn those kmers to hapmers, there are additional steps needed. 1) Exclude reads where a kmer is shared between both sets but belongs to the 'single copy' peak (if any, see below) If there were reads excluded in this step, the binned kmer set has to be re build over in the same process 2) Identify where these kmer exist in the assembly 3) Find reads with linking kmer(s), which the kmers are from the 2-copy region and shared between the homologous sequences from pseudohaplotype 0 and 1 4) Only use binned kmers that has supporting evidence on both sides to the flancing homozygous kmer identified from 3), and has a pairing haplotype kmer

Still, the kmers obtained at last step will be missing some of the true heterozygous kmer that could be used as a hapmer (low sensitivity). It will only give you the set of kmers that belong to one of the haplotypes, and also was found in the assembly. In other words, it won't be able to capture the hapmers not found, either missing from the assembly or not detected. To test the hapmer coverage, the hapmer plot may be able to show how much is missing.

For 1), this is to double check if the kmers binned does not contain overlapping kmers from the haploid peak. Those would happen when a read gets mis-mapped to a haplotype-mixed region of the assembly.

To test this, you could visually check from the hapmers plot and quantify them from the histogram. Let's call the binned kmer sets, containing the shared kmers as ps0.meryl ps1.meryl. And we can call your full illumina kmer set of the child as child.meryl.

Try run $MERQURY/trio/hapmers.sh ps0.meryl ps1.meryl child.meryl. It will generate a plot similar to this, with more black read-only on the first and second peak: image In your plot, the green portion of "shared" in the first peak will indicate how much of phase violation happened. The red and blue will be the sensitivity of the kmers, matching each pseudo-haplotype.

Thanks, Arang

cahuparo commented 6 months ago

Hi @arangrhie,

Building on the conversation started by Ben Mansfeld in Issue #6 regarding hap-mer generation without access to parental genomes, I've adopted a similar strategy using Illumina reads alongside ONT R10 data to construct and evaluate phased genome assemblies. After assembly improvement steps with tools like purged_dups, NextPolish, and RagTag, and phasing with HapDup, I'm now in the process of quality assessment for these "dual" assemblies. This is a diploid genome of highly heterozygous plant pathogen.

Workflow and Issue Description: Inspired by Ben's methodology, my workflow integrates meryl to derive unique hap-mer databases, followed by the generation of blob plots through Merqury. However, much like the issue Ben encountered, the resulting plots are exceptionally clean, leading to doubts about the phasing precision and hap-mer authenticity.

Approach for Hapmer Creation and Blob Plot Generation:

  1. Hap-mer Creation:

    • Used meryl to count k-mers from Illumina data to create a k-mer database.
    • Differentiating k-mers into haplotype-specific sets using meryl difference, producing hap1_unique.meryl and hap2_unique.meryl.
    • Intersecting unique haplotype k-mers with read data using meryl intersect to find overlaps with PE reads, creating a hap1_pe_intersect.meryl (similarly for hap2).
    • Merging the intersected k-mer sets with meryl union-sum to get a combined set of hap-mer databases.
  2. Blob Plot Creation:

    • With the combined hapmer databases, Merqury was run to assess the quality of the dual assemblies.
    • Merqury used the hapmer databases to calculate QV (Quality Value) scores and generate blob plots.
head i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_hap1_vs_hap2.qv
i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_1      207900  134365932       40.6539 8.60222e-05
i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_2      203620  134563709       40.7507 8.41261e-05
Both    411520  268929641       40.7021 8.50734e-05

The concern arises when the blob plots show an overly distinct separation of hap-mers, potentially indicating over-filtering or other issues in the hap-mer generation pipeline.

Concern: The unusually distinct separation of hap-mers in the blob plots echoes Ben's "too good to be meaningful" observation, indicating potential over-filtering or errors in the hap-mer derivation.

i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_hap1_vs_hap2 hapmers blob

i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_hap1_vs_hap2 spectra-asm fl

Request for Input: In the spirit of the insights shared in this issue, I'm reaching out for guidance on deciphering these blob plots and for suggestions on extra validation procedures. I'm particularly interested in any methodological alterations that could mitigate over-filtering or adapt to the unavailability of parental genomes.

Specific Questions:

Thank you for your time and consideration,

Camilo

PS. The plots are raw as they come out so there is some overlap with legends.