Closed dubssieg closed 4 months ago
The MC and PGGB graphs were both constructed by assigning contigs to reference chromosomes, then constructing a graph for each chromosome. But the chromosome assignment logic was different between the two methods, which is why you see the difference in your venn diagram. This difference is going to be largest on accrocentric chromosomes like 21 (especially when considering the GRCh38-based mc graph, which will cover much less of 21's short arm due to it being a gap in GRCh38). The thousands of contigs difference you are seeing are mostly tiny fragments of rdna and centromere and probably do not align very confidently anywhere.
That said, the contigs themselves are identical between the two graphs. Using your example HG02886#2#JAHAOT010000029.1
:
wget -q https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.chroms/chr21.full.og
odgi paths -i chr21.full.og -lL | grep JAHAOT010000029
HG02886#2#JAHAOT010000029.1#0 1 41271147
Which can be verified in the various other indexes
zcat hprc-v1.1-mc-grch38.full.gfa.gz | grep JAHAOT010000029 | grep HG02886
W HG02886 2 JAHAOT010000029.1 0 41271147 >42817236> ...
vg paths -Ev hprc-v1.1-mc-grch38.full.gbz -Q HG02886#2#JAHAOT010000029
HG02886#2#JAHAOT010000029.1#0 41271147
halStats hprc-v1.1-mc-grch38.full.hal --sequenceStats HG02886.2 | grep JAHAOT010000029
JAHAOT010000029.1, 41271147, 182945, 0
Thanks a lot for your quick reply, that explains most of the problems I had.
I was able to replicate on my end your results.
vg paths -Ev chr21_mgc_grch38.gfa -Q HG02886#2#JAHAOT010000029
HG02886#2#JAHAOT010000029.1#0 41271147
To my knowledge, MC uses walks to keep trace of the contigs. Taken from the GFA-spec readme:
Walk: (since v1.1) an ordered list of oriented segments, intended for pangenome use cases. Each consecutive pair of oriented segments must correspond to a 0-overlap link record.
If I understand this well, as each consecutive pair corresponds to a 0-overlap link, I can take every segment described in a path, and sum their lengths to get the length of the path; hence the representation of the contig in the graph.
This is how I computed the lengths of each contig that I put in the table. I verified, all overlaps in the file are 0M
.
Is my understanding of the GFA format wrong here? Does the paths can feature certain nodes, but not cross them? Sorry for all the questions.
All my apologies, was a mistake in my parser. Thanks for all the information. Have a nice day.
Hello,
I was wondering about the completeness of the MGC graphs. I wanted to analyse differences between mgc and pggb graphs, but noticed two things that are weird to me. First, the two graphs does not share the same number of paths. For instance, I have 981 paths in the MGC GRCh38 chr21
.full
graph and 3029 in the PGGB one. If I intersect those (based on name of paths), I get the following results:I investigated next on the lengths of the sequences, and I will take as example HG02886#2#JAHAOT010000029.1 (on chr21). When extracting the sequence from the
.agc
file, its length is 41271147. The path corresponding to this sequence is 41271147 length in the PGGB graph (same as the input sequence). However, in the.full
graph from MGC, its length is 41999755.As far as I know, the
.full
graph is non-clipped and non-filtered (equivalent of--clip 0 --filter 0
incactus-graphmap-join
). But how does MGC adds data to the sequences? Does this comes from the merging of multiple input sequences into one, explaining the lowest number of paths? If I sum the lengths of every path in the graph, I get 3535614369 for the mgc one and 4012006987 for the pggb one.Here are some example values I gathered, on some of the longest sequences of the chr21. It happens on every path of the MGC graph.
I was wondering if you had any explanation about this. I tried to find clues in the readme, other issues as well as the publication; I might have missed it however. Anyways, thanks in advance for any answer, even partial.