marbl / merqury

k-mer based assembly evaluation
Other
280 stars 19 forks source link

Phase #53

Open duhuipeng opened 3 years ago

duhuipeng commented 3 years ago

Dear author I would like to consult where the color is inconsistent in the picture, (the changing contig),What is the generating file that lets me know which contig it is? image image I personally feel, looking from the picture, this switch error is somewhat serious Last question Why in the generated switches.txt,Show only the switch error results of one phase, where the other phase is to see? image

arangrhie commented 3 years ago

Hi @duhuipeng ,

  1. The changing color What you are seeing is the phase block. You can track the switched blocks from *.phased_block.bed. Each column in phased_block.bed is:

    <scaff> <start> <end> <haplotype> <switch_error> <num. switches> <num. markers>

    So in your case, search for Mat_hap2 in the 4th column. That is the locus where it got picked up as hap2.

  2. N plot The N plot is showing the phase block N value relative to the conting N value. The closer the block line (red) goes to contigs (blue), it means the phase block size is closer to the contig size. I think yours is very well phased. You can see how it looks on the absolute relativeness compared to the expected assembly size; the NG scale. Use Rscript $MERQURY/plot/plot_block_N.R --help to see how to generate the NG plots.

  3. switch error rate in one file? You can find both switch error rates with

    cat *.phased_block.stats

    See https://github.com/marbl/merqury/wiki/3.-Phasing-assessment-with-hap-mers#5-phased-block-statistics-and-switch-error-rates for more details.

duhuipeng commented 3 years ago

Dear author Thanks for your reply According to your advices, I looked at the phased_block.bed file,But I now have a point not to understand ,Hope for your help(as follow) image I found hap2 in hap1 but it switch_error and num switches are 0 but why would it divide called hap2? It is based on what divides the hap2, instead of the hap1?

arangrhie commented 3 years ago

The last column here are the num. kmers found from the haplotype defined as in col. 4. So in your example, 1582 hap2 kmers were found, with no hap1 kmers, and 2234 hap2's, with no hap1s. It looks to me the phasing did failed on these smaller pieces and took the wrong path.

-Arang

duhuipeng commented 3 years ago

Dear author I still don't understand some, according to your said The num. markers is the total number of people from the hap2 and not the hap1 But why is the num. switches a 0? Does num. switches refer to the number from hap1 when the num. switches is 0? If so, I would understand . But another point that I don't understand came out again:

If hap2, appears in hap1 and num. switches is 0, then in theory, these hap2 should appear in the second phase, not in the first Do you feel I understand that right? What is the cause for this?

arangrhie commented 3 years ago

Merqury is defining a phase block based on how many hap1 or hap2 kmers are found. Once the conditions are met to be decided that a block belongs to hap1, the block becomes 'hap1 block' (col. 4). The switches are defined from the occurrence of the other haplotype's kmer. So, if a block was defined as 'hap1', the last col. shows how many kmers were seen from 'hap1'. If there were any kmers found from 'hap2', then that will become a 'switch'.

Note the switch here is defined based on a phased block, not the assembly. In a lot of assemblies, especially pseudo-haplotype assemblies, it is hard to make an assumption that a given fasta was assembled from one haplotype. Therefore, Merqury is reporting switch error rate based on what has been defined as a phase block, given how many haplotype-specific kmers are seen.

As your assembly seems already completely phased, it may be an interest to report the hamming error rate, which measures the occurrence of the k-mers from the unexpected haplotype to the other in the entire assembly, not the phase blocks.

It's easy to calculate that from *.hapmers.count. Each columns are:

<fasta> <scaffold/contig> <hap1> <hap2> <total>

So you can grep all lines from the hap1 assembly (given the fasta name in the 1st col.) and count how many switches occurred with SUM(hap2) / SUM(total). For example, if you were running merqury.sh with merqury.sh read.meryl mat.hapmers.meryl pat.hapmers.meryl mat.fasta pat.fasta output,

awk '$1=="mat" {mat_kmer+=$3; pat_kmer+=$4; total+=$NF} END {print mat_kmer"\t"pat_kmer"\t"total"\t"pat_kmer/total}' *.hapmers.count

Will give you maternal-mers paternal-mers total-mers switch-rate.

In most cases though, I feel per-block switch error rate is more important. As shown in your case, there were only two small blocks that were entirely switched. The hamming error rate does not account for 'where' the switches are. It could be more problematic when it is spread all over the assembly versus locally happening in a block. If it's locally happening, as a block, the chance of corrupting a gene from chimeric haplotype joins goes down.

duhuipeng commented 3 years ago

Dear author I still don't quite understand the swtich error in this result(as follow),This is the result of mat phase For example, image,By truth, it belongs to the mother phase, the hap1 Some of it contig were divided to hap2, so at this time should be a switch, in the 1 ,The num. markers of the contig is 1,175,The num. markers of this contig is 1175, this 1175 should belong to hap2, and at this time 59 belongs to hap1, so, 59 / 1175 is it switcher error, Similarly, 2 It is also very good to understand this But 3 doesn't understand .in the 3,The num. markers of the contig is 68,this 68 should belong to hap2,But the num. switches is 0,That is, his switch error should be 0, Why did this fragment of contig appear in hap1, not in hap2? Similarly, the pat phase is the same result image

arangrhie commented 3 years ago

What you understood is correct.

Why did this fragment of contig appear in hap1, not in hap2?

What you are asking is beyond what Merqury does. It could be a base error that accidentally overlapped with an actual haplotype specific kmer. However, the chance for this is very low. Alternatively, it could be a phasing error in the assembly method. Ask the developers of the assembler used why you see these haplotype switches.

Merqury is a tool designed to detect these kinds of switch errors and helps you localize where the error is.

duhuipeng commented 3 years ago

I got it Thanks

duhuipeng commented 3 years ago

I got it Thanks