Low accuracy on simulated reads and overlapping primary alignments

xchang1 commented 1 month ago

Hello,

I am trying to compare the performance of different mappers on long reads and minigraph has unexpectedly low accuracy. I'm mapping 1 million simulated HiFi and R10 reads to the HPRC v1.0 chm13 minigraph graph with minigraph. Minigraph is about 1% less accurate than graphaligner on the same graph, minimap2 on chm13, as well as other mappers on the minigraph-cactus graph. I expected minigraph to perform at least as well as minimap2. Do you see anything wrong about the way I processed the graph or ran minigraph?

In order to get the output gaf to work with the vg tools, I edited the gfa: sed 's/chr([0-9]*|X|Y|M)/CHM13#0#chr\1/g' to change the reference names and sed 's/\ts([0-9]*)\t/\t\1\t/g' to take the s out of the segment names. I then ran minigraph with: minigraph --vc -N 0 -cx lr -t {threads} {input.gfa} {input.fastq} >{output.gaf}

This may be an unrelated issue, but I also noticed that minigraph produces multiple primary alignments that sometimes overlap in the read or the graph. I attached an example of such a read. S1_19235.gaf.txt As far as I understand it, these cannot be chimeric alignments because some of them overlap in the read or the graph. Should some of them be considered secondary alignments?

Thanks! Xian

lh3 commented 1 month ago

If you don't want secondary alignment, use option --secondary=no. Applying -N0 will reduce mapping accuracy. When minimap2 sees -N0, it will ignore the option and throw a warning because this is a common mistake. Minigraph doesn't have this mechanism.

How accuracy is evaluated? Around a complex VNTR, minigraph often can align most bases correctly but may choose a few wrong nodes in the middle of the alignment. If you require exact path match, graphaligner can be more accurate. The minigraph paper mentions this limitation. The latest minigraph alleviates this problem but the latest graphaligner might be better.

Both bwa-mem and minimap2 may also output multiple primary alignments with overlaps as there are often local homology around a breakpoint. This is the expected behavior.

xchang1 commented 1 month ago

Thanks for the quick response!

I tried running it again with --secondary=no instead of -N0 but the accuracy is still low.

I'm using vg annotate and vg gamcompare for evaluating accuracy. Reads are annotated with reference positions everywhere they overlap the reference paths in the graph. If any of the annotations on the read match any of the truth annotations on the simulated read, regardless of where they occur on the read, then it is counted as correct.

lh3 commented 1 month ago

This is inconsistent with my old evaluation on GRCh38. Perhaps most wrong mappings come from chm13 centromeres. Minigraph doesn't try hard to align centromeric reads as on real data, a large fraction of centromeric reads can't be aligned between samples anyway.

lh3 / minigraph

Low accuracy on simulated reads and overlapping primary alignments #115