Closed sivico26 closed 1 year ago
Hi Sivico, can you parse your GFA in order to put in there just one path that triggers your issue? Compressing it with zstd it might become small enough to be 'shareable'.
Also, to be 100% sure I am following, it would be helpful for me to see all exact steps in your codes/pipeline/commands that give you the two numbers that should be identical (but aren't for whatever reason).
Hi @AndreaGuarracino
While preparing the files to be shared, I realized the issue was in the way I was parsing the matrix. In short, I assumed the output of odgi paths -H
was a binary matrix. However, it actually is an integer matrix where the values are proportional to the depth of each path on each node. The problem was that I was ignoring values different than 1 (this should be fine for a binary matrix).
The only thing that is perplexing to me is how this problem did not arise before. The only way would be that all the previous graphs I tried produced binary matrices (which seems to be the case). Biologically, this is hard to believe, especially because a couple of previous graphs were a subset of the current graph. This leads me to ask you the following: was the behavior of odgi paths -H
to produce binary matrices in previous versions of odgi
?
Hi @sivico26, here there is a change that could totally explain your experience! https://github.com/pangenome/odgi/commit/c5b545b33743c87a9165b259093839f9e78b78f9
Before, the output of odgi paths -H
was a binary matrix by default, now it is a path coverage matrix by default, and optionally a binary matrix.
Oh, that is so reassuring! Thanks @AndreaGuarracino. Now everything squares again.
The only thing I would point out, is that the current help of odgi paths
's help does not say how to activate the binary matrix output.
Other than that I think we can considered a solved issue. Thanks again for the fast response and furthermore for developing the pangenomics tools ecosystem.
Cheers
Oops, I've checked the code and you're right, it hasn't been put as an option again! I guess you are already easily parsing the haplotype matrix by putting 1 if the value is greater than 0.
Glad to have helped! :vulcan_salute:
Hi odgi team,
I have been playing and getting into the world of pangenomes for an ongoing research project. Part of the way we are trying to analyze our pangenome is by looking at shared nodes among the paths of the graph. We found the binary matrix generated by
odgi paths
particularly useful in this regard:In a previous discussion (see #444 ), I thought the elements of the matrix along the path should sum up to the
node.counts
number reported in the output. @subwaystation explained to me that this is not the case given that a path can go through the same node several times (which would not change the matrix), so you have to take the unique nodes the path goes through. Thenode.counts
was then changed topath.step.count
which is more transparent. I adjusted my code accordingly to remove the duplicates and count only the unique elements and it worked like a charm.Now, recently we scaled up and generated our largest pangenome so far (getting very close to what we actually need for our project). This pangenome is composed of 7 plant species of the same genera and their genome size is ~ 4Gb. We built the pangenome using:
cactus
$\rightarrow$hal2vg
$\rightarrow$vg construct
$\rightarrow$smoothxg
$\rightarrow$gfaffix
$\rightarrow$odgi build
. The last step was just used to optimize the node numbering.In any case, for this pangenome, I observe that
odgi paths
is giving inconsistent results, similar to what I thought the problem was in #444, but this time the problem is real. In other words, when I sum the number of unique nodes traversed by a path they are not equal to the number of ones I see in the matrix produced byodgi paths
for that path.To give you an idea, here are some numbers:
So each difference goes up for several million nodes for each path (If I go to base space, each path is missing ~ 20 Mb). Curiously enough, the first lines of the matrix (
path.length
andpath.step.count
) are correct: the lengths correspond to the chr sizes and when I calculate thestep.counts
myself, I get the same numbers.Finally, I would also like to point out that this did not happen for any of my smaller graphs. All the previous ones I have tried have matched counts from the matrix and paths perfectly (as they should). Nevertheless, all have been smaller too, so this phenomenon is definitively specific to this pangenome or to this scale.
Do you have any idea what may be causing it? Do you think there is something flawed with the pangenome that can explain it? what could that be? On the other hand, can you think on something that could go wrong with the way
odgi paths
is filling the matrix at this scale? It is clearly traversing the paths just fine but seems to be forgetting to mark some nodes as True.Your thoughts are very much appreciated
Cheers, Sivico
P.S: I am using
odgi
version v0.8.2-0-g8715c55