Results of Normalize Snarls Indicates Inconsistent Measurement of Mapping Accuracy

Robin-Rounthwaite commented 1 year ago

1. What were you trying to do? Normalize Snarls is built to take each snarl, extract their haplotypes, and realign them using sPOA to generate a new and hopefully improved representation of variation.

Running this process has revealed multiple cases where two apparently identical mappings of the same read (one in normalized, one in unnormalized) are marked as different levels of accuracy. I'll use read seed_12345_fragment_110250 as an example

2. What did you want to happen? To map read seed_12345_fragment_110250_1 in the normalized and unnormalized graphs, and get the same response with regards to accuracy. (i.e. both True or both False for the mapping accuracy).

3. What actually happened? seed_12345_fragment_110250_1 is mapped as accurate in the unnormalized graph, but not the normalized graph. But both have the same reference mapping position, the same mapping score, and the same mapping quality of 60. This suggests that these two mappings are effectively identical. (seed_12345_fragment_110250_2 remained unaffected - normalized and unnormalized mappings are marked as accurate.)

Unnormalized mapping output (gam of a single read pair): s3://vg-k8s/users/rrounthw/normalize_snarls/report-giraffe-bug/seed_12345_fragment_110250.unnormalized.gam

Normalized mapping output (gam of a single read pair): s3://vg-k8s/users/rrounthw/normalize_snarls/report-giraffe-bug/seed_12345_fragment_110250.normalized.gam

truth-set gam (gam of a single read pair): s3://vg-k8s/users/rrounthw/normalize_snarls/report-giraffe-bug/seed_12345_fragment_110250.accuracy_drops_for_identical_mappings.true-pos.gam

unnormalized graph: s3://vg-k8s/users/rrounthw/normalize_snarls/report-giraffe-bug/graph.vg

normalized graph: s3://vg-k8s/users/rrounthw/normalize_snarls/report-giraffe-bug/graph.spoa.normalized.vg

You can also find the svgs here: s3://vg-k8s/users/rrounthw/normalize_snarls/report-giraffe-bug/seed_12345_fragment_110250-svgs/

Example commands to run mapping (where BASE is the name of the files without the extension):

vg convert -x ${BASE}.vg > ${BASE}.xg

# #get gbwt, gg, min, dist:

vg index -j ${BASE}.dist ${BASE}.xg

vg gbwt -g ${BASE}.sampled.gbwt.gg -x ${BASE}.xg ${BASE}.sampled.gbwt

vg minimizer -g ${BASE}.sampled.gbwt -d ${BASE}.dist -o ${BASE}.min ${BASE}.sampled.gbwt.gg 

vg giraffe -M 10 -x ${BASE}.xg -H ${BASE}.sampled.gbwt -g ${BASE}.sampled.gbwt.gg -m ${BASE}.min -d ${BASE}.dist -G ${SIM_READS} -b default -i -t 22 > ${OUTPUT_DIR}/${METADATA}.${BASE}.mapped.gam

READS=novaseq6000
GBWT=sampled
PARAM_PRESET=default
PAIRING=single

echo GRAPH ${GRAPH} GBWT ${GBWT} READS ${READS} PARAM_PRESET ${PARAM_PRESET} PAIRING ${PAIRING} SPEED ${SPEED} CORRECT_COUNT ${CORRECT_COUNT} MAPQ ${MAPQ} MAPQ60 ${MAPQ60} IDENTITY ${IDENTITY} SCORE ${SCORE}

vg gamcompare -r   100 -s <(vg annotate -m -x ${BASE}.xg -a ${OUTPUT_DIR}/${METADATA}.${BASE}.mapped.gam) ${SIM_READS} 2>${OUTPUT_DIR}/${METADATA}.${BASE}.count | vg view -aj - > ${OUTPUT_DIR}/${METADATA}.${BASE}.compared.json

CORRECT_COUNT="$(sed -n '1p' ${OUTPUT_DIR}/${METADATA}.${BASE}.count | sed 's/[^0-9]//g')"
SCORE="$(sed -n '2p' ${OUTPUT_DIR}/${METADATA}.${BASE}.count | sed 's/[^0-9\.]//g')"
MAPQ="$(grep mapping_quality\":\ 60 ${OUTPUT_DIR}/${METADATA}.${BASE}.compared.json | wc -l)"
MAPQ60="$(grep -v correctly_mapped ${OUTPUT_DIR}/${METADATA}.${BASE}.compared.json | grep mapping_quality\":\ 60 | wc -l)"
IDENTITY="$(jq '.identity' ${OUTPUT_DIR}/${METADATA}.${BASE}.compared.json | awk '{sum+=$1} END {print sum/NR}')"
echo GRAPH ${GRAPH} GBWT ${GBWT} READS ${READS} PARAM_PRESET ${PARAM_PRESET} PAIRING ${PAIRING} SPEED ${SPEED} CORRECT_COUNT ${CORRECT_COUNT} MAPQ ${MAPQ} MAPQ60 ${MAPQ60} IDENTITY ${IDENTITY} SCORE ${SCORE}
printf "${GRAPH}\t${GBWT}\t${READS}\t${PARAM_PRESET}\t${PAIRING}\t${SPEED}\t${CORRECT_COUNT}\t${MAPQ}\t${MAPQ60}\t${IDENTITY}\t${SCORE}\n" >> ${REPORT_TSV}
# jq -r '(if .correctly_mapped then 1 else 0 end|tostring) + "," + (.mapping_quality|tostring) + "," + (.score|tostring)' ${OUTPUT_DIR}/${METADATA}.${BASE}.compared.json | sed 's/,/\t/g' | sed "s/$/\tgiraffe_${PARAM_PRESET}_${GRAPH}${GBWT}${READS}${PAIRING}/" >> ${ROC_STATS_TSV}
jq -r '(if .correctly_mapped then 1 else 0 end|tostring) + "," + (.mapping_quality|tostring) + "," + (.score|tostring)' ${OUTPUT_DIR}/${METADATA}.${BASE}.compared.json | sed 's/,/\t/g' | sed "s/$/\t${GRAPH}/" >> ${ROC_STATS_TSV}

6. What does running vg version say?

vg version v1.44.0-208-g319d08485 "Solara"
Compiled with g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 on Linux
Linked against libstd++ 20220421
Built by robin@robin-Zephyrus

glennhickey commented 1 year ago

I think the problem here is that vg annotate is annotating all the _alt_ paths in the graph. You have these "alt" path positions in your truth that are also in the original graph and it's able to find a match with those paths. Your normal graph have no alt paths and therefore cannot make the same match.

You can verify this by dropping the alt paths from your original graph vg paths -v graph.vg -q _alt -d > graph.clean.vg and you will get the same result as your normalized graph.

I'm not sure the context of your comparison, but you may want to make sure you're only looking at annotations of real paths?

It may be a good idea to make vg annotate ignore alt paths by default, but maybe there's a use case for them....

Robin-Rounthwaite commented 1 year ago

After discussion in meeting, looks like the consensus is: If I want to use vg annotate, I need to make sure that both graphs have the same alt paths. Good to know!

vgteam / vg

Results of Normalize Snarls Indicates Inconsistent Measurement of Mapping Accuracy #3860