rrrlw / TDAstats

R pipeline for computing persistent homology in topological data analysis. See https://doi.org/10.21105/joss.00860 for more details.
https://rrrlw.github.io/TDAstats
GNU General Public License v3.0
37 stars 9 forks source link

Dist changes #19

Closed peekxc closed 4 years ago

peekxc commented 4 years ago

TDAstats supports as input to calculate_homology a point cloud or distance matrix, indicated by the given format parameter. This PR modifies it to support dist objects natively, both at the interface level and at the C++ level.

Currently, if a standard dist object is passed in and distmat is not specified, then calculate_homology assumes it's a 1-dimensional point cloud, yielding different barcodes than as if one were to pass the corresponding distance matrix.

More ever, if the distmat option is specified, the dist object is converted to a distance matrix via as.matrix, only to then again be converted again in C++ back to its lower triangular form; as.matrix applied to dist objects copies every distance 2x, increasing the amount of memory needed to pass in a dist object by 2. To fix this, a new function ripser_cpp_dist is added to the source that accepts dist vectors natively, and minimizes unnecessary copying by std::moves them into compressed_upper_distance_matrix's (since R is column-major and Ripser uses row-major indexing, the compressed_upper_distance_matrix in Ripser is equivalent to lower triangular portion of the distance matrix in R, and vice versa.

Here's the reprex showing the new interface:

library(TDAstats)

# create a 2-d point cloud of a circle (100 points)
num.pts <- 100
rand.angle <- runif(num.pts, 0, 2*pi)
pt.cloud <- cbind(cos(rand.angle), sin(rand.angle))

pers.hom1 <- calculate_homology(pt.cloud)
pers.hom2 <- calculate_homology(dist(pt.cloud)) ## this one used to differ from the rest
pers.hom3 <- calculate_homology(dist(pt.cloud), format = "distmat")
pers.hom4 <- calculate_homology(as.matrix(dist(pt.cloud)), format = "distmat")

all.equal(pers.hom1, pers.hom2)
#> [1] TRUE
all.equal(pers.hom1, pers.hom3)
#> [1] TRUE
all.equal(pers.hom3, pers.hom4)
#> [1] TRUE

Created on 2020-06-23 by the reprex package (v0.3.0)

The only thing that's important is pers.hom2. It used to give different results than the rest.

codecov-commenter commented 4 years ago

Codecov Report

Merging #19 into master will decrease coverage by 1.08%. The diff coverage is 76.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #19      +/-   ##
==========================================
- Coverage   93.37%   92.29%   -1.09%     
==========================================
  Files           6        6              
  Lines         589      584       -5     
==========================================
- Hits          550      539      -11     
- Misses         39       45       +6     
Impacted Files Coverage Δ
src/ripser_short.cpp 92.03% <73.91%> (-0.42%) :arrow_down:
R/calculate.R 87.50% <79.16%> (-9.56%) :arrow_down:
R/inference.R 93.67% <0.00%> (-0.88%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0700bff...d2e2337. Read the comment docs.

rrrlw commented 4 years ago

Wow @peekxc this is fantastic!! Both the support for the new p parameter (and supporting calculations) and supporting dist objects (and the associated tests) are extremely helpful. Sorry it's taken me so long to get to this.

One of the things I've been meaning to do for a while is split TDAstats up - ggtda already deals w/ visualization and tools for statistical inference are being added w/ Shael Brown's help, so my thought was to separate the calculation aspect of TDAstats into another R package (ripserr). In this manner, I could also add cubical ripser (maybe also adding calculation of extended persistence) to ripserr w/ a workstream independent of the rest of TDAstats (and independent releases to CRAN as well). In general, it also seems easier to have compartmentalized packages w/ one major function each (in this case, TDAstats = inference for PH, ggtda = viz for PH, and ripserr = calculation of PH).

With this in mind, in addition to merging this PR into TDAstats (will do shortly after a few confirmatory tests; although functionality will eventually be deprecated in favor of ripserr), I have also ported your code over to rrrlw/ripserr (branch = fromtdastats) in the latest commit (still porting tests over). Is this okay with you? I will be giving you push access to the ripserr repo shortly - feel free to correct any of my mistakes there. Added to you ripserr DESCRIPTION as an author, feel free to correct any misspellings, etc. I made.

Thanks again for this contribution! Looking forward to hearing what you think.

peekxc commented 4 years ago

Hey @rrrlw, just to double-check: goal of ripserr is just to provide engine for PH effectively, and move TDAstats towards more modular sort of interface that may in the future accept options for multiple engines doing persistent homology, correct?

rrrlw commented 4 years ago

Goal of ripserr is as you stated (Ripser + Cubical Ripser + maybe other engines e.g. Flagser). I definitely want to make TDAstats more modular/single purpose as it was getting tricky to maintain in a single package, but goal is to have it contain statistical/inference stuff for TDA (which could be persistent homology output from many engines, maybe even stats for Mapper stuff although I am very much a newbie to that). @corybrunson and I were thinking of another package (tentatively named perrrsist) that would serve as an interface to persistent homology (and other?) engines (analogous to parsnip in tidymodels).

P.S. - although I merged the fromtdastats branch into master in rrrlw/ripserr, if there's a better way of incorporating your code, I'm more than happy to reverse/edit!