Distance measure used by phom.dist()

rrrlw / TDAstats

R pipeline for computing persistent homology in topological data analysis. See https://doi.org/10.21105/joss.00860 for more details.

https://rrrlw.github.io/TDAstats

GNU General Public License v3.0

37 stars 8 forks source link

Distance measure used by phom.dist() #26

Closed sarahsamorodnitsky closed 3 months ago

sarahsamorodnitsky commented 4 months ago

Hello! I am using the phom.dist() function to compute the distance between persistence diagrams. Can you clarify what distance measure is being computed by this function? Is there a reference/citation/source for the distance measure being computed? I was under the impression phom.dist() returned the Wasserstein distance based on the function naming, but looking at a previous issue (https://github.com/rrrlw/TDAstats/issues/13) I see that that isn't the case.

Thanks!

corybrunson commented 4 months ago

Digging into the code in 'inference.R', it looks like the distance is ~the sum~ a vector of the exponentiated absolute differences between the sorted feature lifespans (rather than birth–death coordinates) within each dimension:

$$Dq(X,Y)[d] = \sum{k=1}^{n_d} \lvert (\ell(x_k)) - (\ell(y_k)) \rvert ^ q$$

where $d$ ranges over dimensions, $n_d$ is the maximum number of $d$-dimensional features of $X$ and $Y$, $\ell(x)$ is the lifespan of feature $x$, and the features $x_k$ and $y_k$ are in descending order of lifespan.

@rrrlw may want to chime in. I'm not sure when the package might be upgraded, but certainly clarifying this, and hopefully providing the Wasserstein distance, will be part of that.

sarahsamorodnitsky commented 4 months ago

Any intuition as to why this distance is recommended/used over the p-Wasserstein distance to compare persistence diagrams? I haven't seen this distance measure in the literature, though my literature search has not been exhaustive.

Thanks again!

corybrunson commented 3 months ago

I didn't contribute to it and i don't know a reference for it. My intuition is that it's much less complicated and expensive, though it would certainly be good to provide an explicit rationale.