human-pangenomics / hpp_pangenome_resources

94 stars 3 forks source link

Inconsistencies in path lengths and count between construction methods #26

Closed dubssieg closed 4 months ago

dubssieg commented 4 months ago

Hello,

[!NOTE]\ All comparisons I made here are on the chr21, between the PGGB graph and the .full MGC one build with GRCh38 as reference.

I was wondering about the completeness of the MGC graphs. I wanted to analyse differences between mgc and pggb graphs, but noticed two things that are weird to me. First, the two graphs does not share the same number of paths. For instance, I have 981 paths in the MGC GRCh38 chr21 .full graph and 3029 in the PGGB one. If I intersect those (based on name of paths), I get the following results:

venndiag

I investigated next on the lengths of the sequences, and I will take as example HG02886#2#JAHAOT010000029.1 (on chr21). When extracting the sequence from the .agc file, its length is 41271147. The path corresponding to this sequence is 41271147 length in the PGGB graph (same as the input sequence). However, in the .full graph from MGC, its length is 41999755.

As far as I know, the .full graph is non-clipped and non-filtered (equivalent of --clip 0 --filter 0 in cactus-graphmap-join). But how does MGC adds data to the sequences? Does this comes from the merging of multiple input sequences into one, explaining the lowest number of paths? If I sum the lengths of every path in the graph, I get 3535614369 for the mgc one and 4012006987 for the pggb one.

Here are some example values I gathered, on some of the longest sequences of the chr21. It happens on every path of the MGC graph.

sample agc mgc.full (GRCh38) pggb
HG03098#2#JAHEPL010000022.1 35841889 36505254 (+663365) 35841889 (+0)
HG02486#2#JAGYVL010000061.1 35884607 36536905 (+652298) 35884607 (+0)
NA18906#2#JAHEON010000028.1 35161309 35772290 (+610981) 35161309 (+0)
HG03540#2#JAGYVX010000031.1 32982506 33565599 (+583093) 32982506 (+0)
HG02886#2#JAHAOT010000029.1 41271147 41999755 (+728608) 41271147 (+0)
HG01928#1#JAGYVQ010000034.1 33093239 33678959 (+585720) 33093239 (+0)
HG02723#2#JAHEOT010000048.1 34255374 34907469 (+652095) 34255374 (+0)
HG00673#1#JAHBBZ010000060.1 33176438 33760530 (+584092) 33176438 (+0)
HG01106#2#JAHAMB010000053.1 38091486 38736499 (+645013) 38091486 (+0)
HG01358#1#JAGYZB010000035.1 37543720 38205414 (+661694) 37543720 (+0)
HG03516#2#JAGYYS010000033.1 33103774 33691371 (+587597) 33103774 (+0)
HG00741#1#JAHALY010000076.1 33794583 34424667 (+630084) 33794583 (+0)
NA18906#1#JAHEOO010000072.1 40696337 41402469 (+706132) 40696337 (+0)

I was wondering if you had any explanation about this. I tried to find clues in the readme, other issues as well as the publication; I might have missed it however. Anyways, thanks in advance for any answer, even partial.

glennhickey commented 4 months ago

The MC and PGGB graphs were both constructed by assigning contigs to reference chromosomes, then constructing a graph for each chromosome. But the chromosome assignment logic was different between the two methods, which is why you see the difference in your venn diagram. This difference is going to be largest on accrocentric chromosomes like 21 (especially when considering the GRCh38-based mc graph, which will cover much less of 21's short arm due to it being a gap in GRCh38). The thousands of contigs difference you are seeing are mostly tiny fragments of rdna and centromere and probably do not align very confidently anywhere.

That said, the contigs themselves are identical between the two graphs. Using your example HG02886#2#JAHAOT010000029.1:

wget -q  https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.chroms/chr21.full.og
odgi paths -i chr21.full.og -lL | grep JAHAOT010000029
HG02886#2#JAHAOT010000029.1#0   1   41271147

Which can be verified in the various other indexes

zcat hprc-v1.1-mc-grch38.full.gfa.gz  | grep JAHAOT010000029 | grep HG02886
W   HG02886 2   JAHAOT010000029.1   0   41271147    >42817236> ...

vg paths -Ev hprc-v1.1-mc-grch38.full.gbz -Q HG02886#2#JAHAOT010000029
HG02886#2#JAHAOT010000029.1#0   41271147

halStats hprc-v1.1-mc-grch38.full.hal --sequenceStats HG02886.2 | grep JAHAOT010000029
JAHAOT010000029.1, 41271147, 182945, 0
dubssieg commented 4 months ago

Thanks a lot for your quick reply, that explains most of the problems I had.

I was able to replicate on my end your results.

vg paths -Ev chr21_mgc_grch38.gfa -Q HG02886#2#JAHAOT010000029
HG02886#2#JAHAOT010000029.1#0   41271147

To my knowledge, MC uses walks to keep trace of the contigs. Taken from the GFA-spec readme:

Walk: (since v1.1) an ordered list of oriented segments, intended for pangenome use cases. Each consecutive pair of oriented segments must correspond to a 0-overlap link record.

If I understand this well, as each consecutive pair corresponds to a 0-overlap link, I can take every segment described in a path, and sum their lengths to get the length of the path; hence the representation of the contig in the graph. This is how I computed the lengths of each contig that I put in the table. I verified, all overlaps in the file are 0M.

Is my understanding of the GFA format wrong here? Does the paths can feature certain nodes, but not cross them? Sorry for all the questions.

dubssieg commented 4 months ago

All my apologies, was a mistake in my parser. Thanks for all the information. Have a nice day.