peekxc / Mapper

R package for computing the Mapper construction from Topological Data Analysis
http://peekxc.github.io/Mapper/
Other
6 stars 5 forks source link

Bare Bones THD #10

Open mpcasey2718 opened 4 years ago

mpcasey2718 commented 4 years ago

Hi Matt,

I do not know if this is the right place to post the issue request 

or if these should be broken up, but a bare bones THD implementation would be much appreciated. Is it currently possible to specify a desired resolution for a cover; is this done by the maxmin procedure via specifying the number of landmarks?

From what I understand from the documentation: https://peekxc.github.io/Mapper/reference/use_filter.html ,

we can use our own filters if desired, but is it assumed the filter function returns one value for each row of the dataset or two?

If things need to be further specified, let me know. -Michael

P.S. I sent some comments on the shape recognition post to Slack; if you want me to send them to a different place, let me know.

P.P.S. I apologize for the strange formatting.

peekxc commented 4 years ago

Hi @mpcasey2718, thanks for making an issue. A couple things:

but a bare bones THD implementation would be much appreciated.

I'll look into making a minimal version put together. I'm agnostic to the choice of how the hierarchy is represented; not sure whether I should make my own or use something like data.tree to programmatically make it. I'm secretly an old statistician at heart and prefer more traditional hierarchical formats like the one from hclust.

With a little work, a barebones version should be doable. Since a THD is just a recursive application of Mapper, all the functionality (e.g. getting the connected components ) should exist in the package as-is to make one. Here's a very minimal, illustrative, not well-developed example of how one might go about doing this:

library(Mapper)

## Load world values survey example data; see ?wvs_us_wave6
data("wvs_us_wave6", package = "Mapper")
dataset <- scale(as.matrix(wvs_us_wave6))

## Function to perform mapper on arbitrary subset of data
do_mapper <- function(X){
  m <- MapperRef$new()
  m$use_data(X)
  m$use_filter("PC", rank. = 2L)
  m$use_cover("fixed interval", number_intervals=10L, percent_overlap=20)
  m$construct_k_skeleton(k=1L) ## 1-skeleton
  return(m)
}

library(data.tree)

## Start with a root node
thd <- data.tree::Node$new("thd_root")
thd$mapper <- do_mapper(dataset) ## make the mapper on the whole data set
thd$subset <- seq(nrow(thd$mapper$data())) ## point indices of subset mapper was created on

## Create the first level 
v_ids <- thd$mapper$simplicial_complex$vertices
cc <- thd$mapper$simplicial_complex$connected_components
for (cc_id in unique(cc)){
  data_subset <- unlist(thd$mapper$vertices[as.character(v_ids[cc == cc_id])])

  ## Only create children for subsets larger than 100 points
  if (length(data_subset) > 100){
    child_node <- thd$AddChild(name = sprintf("1-%d", cc_id))
    child_node$mapper <- do_mapper(dataset[data_subset,,drop=FALSE])
    child_node$subset <- unname(data_subset)
  }
}

Created on 2019-08-17 by the reprex package (v0.3.0)

This creates a single level of a THD from the root node using the subsets of the data given by the connected components. Obviously, this solution needs a lot of work, e.g. a recursive building process along with some parameters like the 'network size' threshold to break the recursion, plus nice plotting functions, etc.

Is it currently possible to specify a desired resolution for a cover;

I've always interpreted the term 'resolution' to signify the courseness of the open sets that make-up the cover in the filter space. To this end, the number_intervals argument for the fixed interval type of cover conceptually controls the 'resolution' (and percent_overlap the 'gain').

is this done by the maxmin procedure via specifying the number of landmarks?

See this issue. A maxmin landmark-based ball cover is coming. Right now, only the interval-type covers are well-supported. In these types of covers, the placement of the intervals is outside of the control of the user; they're placed equally along the range of the map.

I have a (slightly out-of-date) tutorial on making your own cover and am totally open to PRs for other cool covering methods.

From what I understand from the documentation: https://peekxc.github.io/Mapper/reference/use_filter.html ,

we can use our own filters if desired, but is it assumed the filter function returns one value for each row of the dataset or two?

This package works with lens spaces of any dimension. Some caution is needed to in using multivariate maps with e.g. interval-type covers due to how quickly things blow up. For example:

library(Mapper)

## Load world values survey example data; see ?wvs_us_wave6
data("wvs_us_wave6", package = "Mapper")
dataset <- scale(as.matrix(wvs_us_wave6))

m <- MapperRef$new()
m$use_data(dataset)
m$use_filter("PC", rank. = 1L)
m$use_cover("fixed interval", number_intervals=5L, percent_overlap=30)
m$construct_k_skeleton(k=1L) ## 1-skeleton

print(m$simplicial_complex)
#> Simplex Tree with (6, 5) (0, 1)-simplices

## 2-D filter. Note open cover now has 25 open sets! 
m$use_filter("PC", rank. = 2L)
m$use_cover("fixed interval", number_intervals=5L, percent_overlap=30)
m$construct_k_skeleton(k=1L) ## 1-skeleton

print(m$simplicial_complex)
#> Simplex Tree with (22, 43) (0, 1)-simplices

## Same as above, but more explicit
m$use_filter("PC", rank. = 2L)
m$use_cover("fixed interval", number_intervals=c(5L, 5L), percent_overlap=c(30, 30))
m$construct_k_skeleton(k=1L) ## 1-skeleton

print(m$simplicial_complex)
#> Simplex Tree with (22, 43) (0, 1)-simplices

Created on 2019-08-17 by the reprex package (v0.3.0)

The first mapper constructs the 1-skeleton from a cover of 5 open sets equally distributed along the first principle component.

The second mapper and third mappers have a 2d filter space. In the first case though, only a single number was provided specifying the number of intervals (where a vector was expected), so it's recycled. Same for overlap.

Some care is needed when working with multivariate maps, as the number of non-empty intersections checks is combinatorial with the number of open sets in the cover.

This package follows the convention that rows indicate points and columns dimensions. So to specify a multivariate lens space, the preferred way is to just precompute the filter as an (n x d) matrix where d is the dimensionality of the filter space.

Example:

library(Mapper)

## Load world values survey example data; see ?wvs_us_wave6
data("wvs_us_wave6", package = "Mapper")
dataset <- scale(as.matrix(wvs_us_wave6))

m <- MapperRef$new()
m$use_data(dataset)

## Precompute principle components
pc_dataset <- prcomp(dataset)$x
print(dim(pc_dataset))
#> [1] 2232   28

m$use_filter(pc_dataset[,1]) ## use 1D map
m$use_filter(pc_dataset[,1:2]) ## use 2D map
## .. etc.

Created on 2019-08-17 by the reprex package (v0.3.0)

P.S. I sent some comments on the shape recognition post to Slack; if you want me to send them to a different place, let me know.

Sounds good, I'll check now!

P.P.S. I apologize for the strange formatting.

Np!

mpcasey2718 commented 4 years ago

Thank you. I believe I can parse most of the code above, but is there a recommended place to start to learn R &/or Rcpp? I can then look to start testing things.

The comments at the end of the data.tree page suggest that it might be ok for small datasets, but the performance may not be so great for larger ones. I don't know if hclust is better in this regard. Do you have a sense of for what dataset sizes you are writing the code?

In terms of precomputing the filter functions, is it faster to do so in Rcpp, or does it not make much difference?

peekxc commented 4 years ago

Regarding learning R, Hadley's book is a good resource if you've learned other languages before and you just went to blaze through the essentials.

Regarding Rcpp, there are introductions online, but if you know C++, it's pretty straightforward. If you don't know C++, it might be quite difficult. Rcpp is essentially a really well-developed framework that uses (really advanced) template type deduction to provide a set of STL-like containers that wrap native R-allocated types in C++, and to simplify difficulty of getting C++ code to run using e.g. R's C API. Again, recommend only picking up if you're comfortable w/ C++ to begin with.

The comments at the end of the data.tree page suggest that it might be ok for small datasets, but the performance may not be so great for larger ones. I don't know if hclust is better in this regard. Do you have a sense of for what dataset sizes you are writing the code?

data.tree uses pure R, and as such is designed to be user-friendly. Things made in pure R are not meant to be fast. If you want speed, you either rely on libraries that call C/C++ (or FORTRAN) code natively, or you write it yourself in C++ and export as a package.

That being said, I think most THDs I've seen really aren't very large, less than 100, usually less than 1000 nodes, right? Following the notion that premature optimization is the root of all evil, I recommend getting anything working first, and then inspect what the bottlenecks are. I expect computing the filters (e.g. tSNE...) hundreds of times will be the biggest bottleneck, not forming the tree.

mpcasey2718 commented 4 years ago

Ok. I'll take a look. I had encountered STL before when I was learning C++ 5 years ago, but have not worked with it much in the mean time. I shall pay attention to how long computing the filters takes.

On Tue, Aug 20, 2019 at 9:48 AM Matt Piekenbrock notifications@github.com wrote:

Regarding learning R, Hadley's book https://adv-r.hadley.nz/ is a good resource if you've learned other languages before and you just went to blaze through the essentials.

Regarding Rcpp, there are introductions online http://dirk.eddelbuettel.com/code/rcpp/Rcpp-introduction.pdf, but if you know C++, it's pretty straightforward. If you don't know C++, it might be quite difficult. Rcpp is essentially a really well-developed framework that uses (really advanced) template type deduction https://en.cppreference.com/w/cpp/language/template_argument_deduction to provide a set of STL https://en.wikipedia.org/wiki/Standard_Template_Library-like containers that wrap native R-allocated types in C++, and to simplify difficulty of getting C++ code to run using e.g. R's C API http://adv-r.had.co.nz/C-interface.html. Again, recommend only picking up if you're comfortable w/ C++ to begin with.

The comments at the end of the data.tree page suggest that it might be ok for small datasets, but the performance may not be so great for larger ones. I don't know if hclust is better in this regard. Do you have a sense of for what dataset sizes you are writing the code?

data.tree uses pure R, and as such is designed to be user-friendly. Things made in pure R are not meant to be fast http://adv-r.had.co.nz/Performance.html#why-is-r-slow. If you want speed, you either rely on libraries that call C/C++ (or FORTRAN) code natively, or you write it yourself in C++ and export as a package.

That being said, I think most THDs I've seen really aren't very large, less than 100, usually less than 1000 nodes, right? Following the notion that premature optimization is the root of all evil https://stackify.com/premature-optimization-evil/, I recommend getting anything working first, and then inspect what the bottlenecks are. I expect computing the filters (e.g. tSNE...) hundreds of times will be the biggest bottleneck, not forming the tree.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/peekxc/Mapper/issues/10?email_source=notifications&email_token=AE7EBPW6D3HPDRRWIMGGLRLQFPY2PA5CNFSM4IMOAZKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4WLGWQ#issuecomment-523023194, or mute the thread https://github.com/notifications/unsubscribe-auth/AE7EBPXT4KDMWGJXCIX2ZR3QFPY2PANCNFSM4IMOAZKA .