vertex-level operations

corybrunson commented 4 years ago

Hi again! I wanted to flag some behavior of the vertex-level operations degree() and adjacent() i think is counterintuitive and would suggest changing slightly—though i haven't carefully checked their possible downstream effects. The reprex below illustrates, and i re-installed simplextree from GitHub. I use igraph as a comparison in both cases because the package is widely-used and it's UI has had a while to mature.

The degree() operation works the same way as an R6 method and as a wrapper, and in practice it is similar to the function of the same name in igraph. Two key behaviors are different, though:

When no vertices argument is supplied, simplextree::degree(s) throws an error; i think the preferred behavior in this case is igraph's, in which igraph::degree(g) returns the degree sequence of the entire graph.
When some vertices are not valid vertex IDs, the simplextree function returns a vector with 0s in those positions, possibly misleading the user. Again i think igraph's behavior is preferred: an error is thrown when any vertex IDs are invalid. (Though i'm most persuadable on this point.)

The adjacent() operation actually works differently as an R6 method and as a wrapper.

It makes sense to me for st$adjacent(vertices) and adjacent(st, vertices) to have the same behavior, in particular to ease code translation between R6 method-chaining and base R function-nesting (or pipe-based method-chaining). Between the two, i prefer the more flexible wrapper behavior, which can handle one or multiple vertices.
Similarly to degree(), i think the operation should fail if any vertex IDs are invalid rather than return numeric(0) or a list with numeric(0) entries.
In case length(vertices) == 1, i suggest that the operation still return a (length-1) list rather than a vector. This would make scripting much easier, in that the structure of the returned value would not depend on the length of the input vector.

If we can agree to an appropriate set of behaviors, then i'll be glad to make the minor changes and submit a PR!

library(igraph)
#> Warning: package 'igraph' was built under R version 4.0.2
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
library(simplextree)
#> 
#> Attaching package: 'simplextree'
#> The following object is masked from 'package:igraph':
#> 
#>     contract
#> The following object is masked from 'package:utils':
#> 
#>     find
#> The following objects are masked from 'package:base':
#> 
#>     remove, serialize
# igraph degree behavior
g <- graph(c(1,2, 2,3, 1,3, 3,4))
degree(g)
#> [1] 2 2 3 1
degree(g, 0:5)
#> Error in degree(g, 0:5): At iterators.c:759 : Cannot create iterator, invalid vertex id, Invalid vertex id
# simplextree degree behavior
s <- simplex_tree(list(1:3, 3:4))
s$degree()
#> Error in s$degree(): could not find valid method
s$degree(0:5)
#> [1] 0 2 2 3 1 0
# igraph adjacent behavior
g <- graph(c(1,2, 2,3, 1,3, 3,4))
adjacent_vertices(g, 1)
#> [[1]]
#> + 2/4 vertices, from 12101ee:
#> [1] 2 3
adjacent_vertices(g, 0:5)
#> Error in adjacent_vertices(g, 0:5): At iterators.c:759 : Cannot create iterator, invalid vertex id, Invalid vertex id
# simplextree adjacent behavior
s$adjacent(1)
#> [1] 2 3
adjacent(s, 1)
#> [1] 2 3
s$adjacent(0:5)
#> Error in s$adjacent(0:5): Expecting a single value: [extent=6].
adjacent(s, 0:5)
#> [[1]]
#> numeric(0)
#> 
#> [[2]]
#> [1] 2 3
#> 
#> [[3]]
#> [1] 1 3
#> 
#> [[4]]
#> [1] 1 2 4
#> 
#> [[5]]
#> [1] 3
#> 
#> [[6]]
#> numeric(0)

^{Created on 2020-10-24 by the reprex package (v0.3.0)}

peekxc commented 4 years ago

Thanks for the ideas as always. I agree with the most of the sentiments. There's actually a couple issues at play here.

In particular, I agree with the following sentiments:

When no vertices argument is supplied... [simplextree::degree(s) should] return the degree sequence of the entire graph.

Agreed. I've changed this behavior for the package-level function simplextree::degree. I also agree with making the module-namespace functions as similar in interface to the package level functions, however sometimes this may not be possible without a lot of added complexity at the C++ level.

suppressMessages(library(simplextree))
st <- simplex_tree(list(1:3, 4:5, 6))
all.equal(st$degree(st$vertices), st %>% degree())
#> [1] TRUE

^{Created on 2020-10-24 by the reprex package (v0.3.0)}

I haven't figured a way to do this at the C++ level.

Why C++ is sometimes hard

I would like to do the same for at the cpp module level, however nothing I've tried works. I always get ```R > st$degree() Error in st$degree() : could not find valid method ``` Things I tried: 1. Default arguments ```R IntegerVector degree_R(SimplexTree* st, vector< idx_t > ids = vector< idx_t >() ){ ... return res; } ... RCPP_MODULE(simplex_tree_module) { Rcpp::class_("SimplexTree") ... .method( "degree", °ree_R) ... }) ``` 2. Using [Nullable parameters](https://gallery.rcpp.org/articles/optional-null-function-arguments/) didn't work. ```R IntegerVector degree_R(SimplexTree* st, Nullable< IntegerVector > ids_ = R_NilValue){ IntegerVector ids = ids_.isUsable() ? IntegerVector(ids_) : (IntegerVector) wrap(get_vertices(st)); ... return res; } ... ``` 3. Surprisingly, overloading didn't either, ```R IntegerVector degree_R(SimplexTree* st, Nullable< IntegerVector > ids_ = R_NilValue){ IntegerVector ids = ids_.isUsable() ? IntegerVector(ids_) : (IntegerVector) wrap(get_vertices(st)); ... return res; } ... ``` Although [it seems possible to overload member functions](http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2010-November/001326.html), overloading with free functions looks like it might not possible with modules actually. From section 7.2.1 [here](https://doc.lagout.org/programmation/Multi-Language/Seamless%20R%20and%20C%20%20%20Integration%20with%20Rcpp%20%5BEddelbuettel%202013-06-04%5D.pdf): > The function name itself has to be unique in the module. In other words, no two functions with the same name but different signatures are allowed. While C++ allows overloading functions, Rcpp modules relies on named identifiers for the lookup and cannot allow two identical identifiers.

It makes sense to me for st$adjacent(vertices) and adjacent(st, vertices) to have the same behavior, in particular to ease code translation between R6 method-chaining and base R function-nesting (or pipe-based method-chaining). Between the two, i prefer the more flexible wrapper behavior, which can handle one or multiple vertices...

In case length(vertices) == 1, i suggest that the operation still return a (length-1) list rather than a vector. This would make scripting much easier, in that the structure of the returned value would not depend on the length of the input vector.

Agreed and agreed. The package level adjacent now returns the adjacencies for every vertex is not specified, and the special case where length(vertices) == 1 has been corrected to return a list.

suppressMessages(library(simplextree))
st <- simplex_tree(list(1:3, 4:5, 6))
all.equal(st$adjacent(1), st %>% adjacent(1))
#> [1] TRUE
all.equal(st$adjacent(st$vertices), st %>% adjacent())
#> [1] TRUE

^{Created on 2020-10-24 by the reprex package (v0.3.0)}

When some vertices are not valid vertex IDs, the simplextree function returns a vector with 0s in those positions, possibly misleading the user. Again i think igraph's behavior is preferred... Similarly to degree(), i think the operation should fail if any vertex IDs are invalid rather than return numeric(0) or a list with numeric(0) entries.

This issue is actually multifaceted I think, but in summary that aspect of the design was intentional. A small factor was that non-throwing functions can be optimized on the C++ level--but the largest factor for choosing to not throw on inputs outside the complex was actually just to maintain consistency with the rest of the package interface.

As a small digression, over the years as C++ has evolved, one of the things that came to light was that exceptions are not always great things. Fun fact: some companies even banned the use of exception handling all together. This primary reason for this is easy to illustrate.

Example nightmare scenario

Consider you a `Container` class that has a member function `AcceptModRequests` that accepts a list request objects with some property `mod`, and appends them to the container. If a particular request doesn't have the `mod` property, an exception is thrown. So what happens if you pass a list of 99,999 valid request objects, but the last one is doesn't have the `mod` property? The first 99,999 are added, one by one to the container... but then when the invalid one is encountered, an exception is thrown. The stack must unwind, and the Container instance ought to be restored to a valid state. But to do that requires removing all the items you just added. You can imagine nightmare scenarios where things like this bring entire servers to their knees.

In simplextree, part of the original design was to make functions which modify the complex idempotent, where possible, and to really simplify building complexes by hand to make interactive use feasible. For example, compare this code

st <- simplextree()
st %>% insert(1:3)

with this code

st <- simplextree()
st %>% insert(list(1, 2, 3, c(1,2), c(2,3), c(1,3), c(1,2,3)))

They both do the same thing, but there are actually a number of TDA packages I've use that require the latter pattern that faces be inserted prior to their corresponding cofaces, otherwise an exception is thrown. And if that pattern was followed here, then the first code example should throw.

Ok thats just the property of being a simplicial complex though, one might argue. But the same idea applies to remove as well. Should this function throw?

st <- simplex_tree(list(1:3, 4:5, 6))
st %>% remove(c(20, 21))

Clearly the edge being removed isn't in the complex. Suppose that, instead of returning invisibly, an exception was thrown because the operation is nonsensical. Now if a user wants to scaffold their own code using remove and they want to protect against against exceptions, they either need to write their code very carefully such that they never accidentally remove a simplex not in the complex (e.g. st %>% remove(1:2); st %>% remove(1) will already fail), or could write a wrapper function:

safe_remove <- function(complex, simplices){
   simplices_that_exist <- complex %>% find(simplices)
   complex %>% remove(simplices[simplices_that_exist])
}

But what's the complexity of this function? You have to traverse the trie twice, so its twice the number of operations than the equivalent code which just removes the simplex if it exists. On top of that, you run into the possible catastrophe given in the nightmare scenario above.

So the counter argument to that is that well let's just add exception checking on function that actually return results, not on functions that modify the data structure. And in some cases, throwing exceptions is extremely useful for e.g. input checking when there really is no alternative to return. A good example of this is the nat_to_sub bijection, which is only defined for a fixed domain + codomain, and has undefined behavior otherwise. degree and adjacent fall under this category, as does find, and when considering those functions in isolation, independent from the rest of the package, one could make a strong argument that the inputs should be checked and exceptions should be thrown for "invalid" ones. But to keep things consistent in the package, and to keep things streamlined for people who are aware of the the fact that non-existent simplices have 0/empty-list degrees/adjacencies, I'm not convinced its really cleaner to use exceptions here.

Besides, this is somewhat common to do in functional programming. A trivial example would be something like:

sum(st %>%> degree(seq(100)))
## vs. 
sum(st %>%> degree(st$vertices[st$vertices < 100]))

The first takes advantage of the default behavior to simplify the code, whereas the latter is more complex but needed nonetheless if exceptions were thrown for invalid vertices. R is a somewhat functional language. How does R handle this?

do.call(rbind, list(NULL, c(1, 2), c(3, 4), NULL))

As a last note, this sentiment as a whole follows lessons learned in the C++ community involving non-existent keys in mapping data structures. If you have a mapping data structure and you want the value associated with a key, but the keys doesn't exist, many have concluded that it's better to return something that is implicitly casteable to false which may or may not contain a value. The reason for this is that it lets the user assume default functionality that he/she can then take advantage of at times, e.g. knowing that non-existent keys yield results castable to false let one compose transform functions that are valid for any input. This is one of the idea of the std::optional structure in C++. For a fun example, see cropping cat photos.

corybrunson commented 4 years ago

Though i don't always say it, i always appreciate your thorough and insightful explanations!

I'm swayed on the preferability of recognizable returns over error throws in the C++ context. Importantly for conceptual consistency, i think, both the 0s in the vector returned by degree(st, v) and the numeric(0)s in the list returned by adjacent(st, v) have reasonable interpretations as the numbers of edges having v[[i]] as a face and the sets of vertices incident to an edge also incident to v[[i]].

One option might be to have a parameter like safe that defaults to FALSE but, if TRUE, performs the additional traversal. Though i actually think it makes more sense to urge the user (in these methods' documentations) to include a find() step when a script is desired to throw errors like those of igraph, since then the code more clearly communicates what it does.

I'll leave the issue open as a flag for the module-namespace $degree() method, but feel free to close it if you feel that it can be put to rest for now. Thank you for the upgrades!

peekxc / simplextree

vertex-level operations #10