lh3 / minigraph

Sequence-to-graph mapper and graph generator
https://lh3.github.io/minigraph
MIT License
419 stars 38 forks source link

Comparing minigraph and Dipcall #85

Open cjain7 opened 1 year ago

cjain7 commented 1 year ago

Dear Heng,

I and @gsc74 tried to compare minigraph and Dipcall SV calls by using HG002 HiFi diploid assembly. We used a similar method for comparison as how you had used the Syndip benchmark in the minigraph paper. We didn't expect to reproduce the exact same numbers that were presented in the paper because significant improvements to minigraph code have been made since then. Unfortunately we are seeing worse performance with the latest minigraph code when compared to the paper. I am sharing the results and benchmarking methodology here. If you have any advice, please let us know.

Genomes used: Diploid HG002 assembly (HAP1=GCA_018852605.1, HAP2=GCA_018852615.1) REF= GCA_000001405.15_GRCh38_no_alt_analysis_set.fna Minigraph version: 0.20-r559

Total SVs called by minigraph: 24004 Fraction of minigraph SVs that overlap with Dipcall SV calls: 91.2% Total SVs called by Dipcall: 28479 Fraction of Dipcall SVs that overlap with minigraph SV calls: 85.5% But the numbers from paper were 94% and 97.3% respectively for calling SVs of length ≥ 100bp.

Minigraph commands:

minigraph -t 4 -cxggs $REF $HAP1 $HAP2 > minigraph.gfa
gfatools bubble minigraph.gfa > minigraph.bed
cat minigraph.bed | awk '{ print $1"\t"$2"\t"$3}' | sort -k 1,1 -k2,2n > minigraph.SV.bed

Dipcall commands:

run-dipcall -x hs38.PAR.bed prefix $REF $HAP1 $HAP2 > prefix.mak
make -j8 -f prefix.mak
tabix -p vcf prefix.dip.vcf.gz
# convert VCF to BED
zcat prefix.dip.vcf.gz | awk '! /\#/' | awk '{if(length($4) > length($5)) print $1"\t"($2-1)"\t"($2+length($4)-1); else print $1"\t"($2-1)"\t"($2+length($5)-1)}' > dip.vcf.bed
# filter variants of length >=50
cat dip.vcf.bed | awk '{ print $1"\t"$2"\t"$3}' | awk '{if($3-$2 ≥ 50) print}' | sort -k 1,1 -k2,2n > dip.vcf.SV.bed

We checked how many minigraph SVs are supported by Dipcall SVs:

# expand Dipcall SV intervals by 1000 bp on both sides
bedops --range 1000 --everything dip.vcf.SV.bed > dip.vcf.SV.expand.bed
bedtools intersect -a minigraph.SV.bed -b dip.vcf.SV.expand.bed -u | wc -l

Similarly, we checked how many Dipcall SVs are supported by minigraph SVs:

bedops --range 1000 --everything minigraph.SV.bed > minigraph.SV.expand.bed
bedtools intersect -a dip.vcf.SV.bed -b minigraph.SV.expand.bed -u | wc -l

Do you think the above results are okay?

lh3 commented 1 year ago

When calculating the fraction of dipcall SVs overlapping minigraph SVs, I was only considering SVs in confident regions. When I did the reverse, I didn't apply the confident region. For comparing SVs, usually we don't apply a hard cutoff on SV length. For example, if we want to evaluate SVs >=50bp, we typically include "truth SVs" >=30bp instead because SV representations are often different. You will get higher percentages with these changes.

You may still see lower percentages in comparison to the numbers in the paper for two reasons. 1) The paper considers SVs >=100bp. The consistency is usually lower for >=50bp SVs. 2) With base alignment, minigraph now produces smaller bubbles. The differences in SV representation will be amplified. For example, if we produce one bubble per chromosome, we will get a 100% consistency. In theory, it would be better to take SV lengths into account, but it is tricky in practice.

PS: I have looked at graphs generated with more recent versions. The graphs generally look tighter and more precise – overall an improvement.