rcppmlpack / RcppMLPACK1

60 stars 14 forks source link

EMST DualTreeBoruvka no option for leaf size #15

Open peekxc opened 7 years ago

peekxc commented 7 years ago

From the documentation:

/**
   * Create the tree from the given dataset.  This copies the dataset to an
   * internal copy, because tree-building modifies the dataset.
   *
   * @param data Dataset to build a tree for.
   * @param naive Whether the computation should be done in O(n^2) naive mode.
   * @param leafSize The leaf size to be used during tree construction.
   */
  DualTreeBoruvka(const typename TreeType::Mat& dataset,
                  const bool naive = false,
                  const MetricType metric = MetricType());

The constructor does not allow leafSize as a parameter, even though the documentation says otherwise

rcurtin commented 7 years ago

Fixed upstream in mlpack/mlpack@86689aa, sorry for the misleading documentation. I don't know what the right way to do this from RcppMLPACK would be, but in C++, if you wanted to set the leaf size during tree construction you would need to build the tree by hand. I can provide more details if you like.

peekxc commented 7 years ago

I'm calling MLPACK from R via Rcpp, so, it is technically C++ but I assume you're talking about the traditional C++ API?

I assume it's possible to change the template parameter from the default typename TreeType = tree::BinarySpaceTree<bound::HRectBound<2>, DTBStat>

To a different type of tree, with its settings set there? If so, I think I can figure it out, you guys have pretty good docs on the kNN trees, thank you though :).

I do have a related question that I'm not sure is appropriate in this thread. I'm investigating to see if leafSize affects the MST results when the input data set has several non-unique edges.

Particularly, I have a test data set that, when I compute the MST on through Python's SciKit-learn's DualTree Boruvka code, I get a set of edge weights that add up to 33.50638... etc. Using the optree package in R, both Prim's and Boruvka's (non-dual tree) implementation return 33.50638 as well.

However, using MLPack's super fast EMST implementation with its default settings returns an MST-looking result, but the edge weights always return 34.11344. Is it possible this is a bug?

rcurtin commented 7 years ago

(This might be a bit verbose, but hey, more information is more fun, so, here goes...)

The BinarySpaceTree is what I would use for the MST calculation in low(ish) numbers of dimensions, so I'd use the same type. The leaf size is the number of points held in each node, and it's a runtime parameter (not a type parameter) to the tree. So I would write it like this:

typedef tree::BinarySpaceTree<bound::HRectBound<2>, DTBStat> TreeType;
std::vector<size_t> oldFromNew; // the tree constructor rearranges points---this gives a mapping
size_t leafSize = 20;
TreeType myTree(std::move(myData), oldFromNew, leafSize); // avoid copy with move constructor
DualTreeBoruvka<> d(&myTree);

...

About leafSize... it should not affect the results at all, only the runtime. The dual-tree recursion is done in such a way that it recurses to the leaves before performing the brute-force base case computation, which for DTB if I remember right is where the next closest pairs to be joined in the MST are found. So the leafSize only controls how deep this recursion will go vs. how many brute-force base case computations are done at each iteration. So the leaf size should not affect the way in which the minimum spanning tree is constructed.

For the edge weights issue, I'm happy to look into it if you can post a minimum working example (probably best done as a new issue on the mlpack project), and a way to reproduce the results. It may be a bug, but I'd actually suspect that more likely the values that are being output by the other libraries are subtly different quantities. The DTB code is fairly thoroughly tested, but it's never possible to say there are no bugs in something. :)

peekxc commented 7 years ago

@rcurtin Thank you for your replies! And for the code/detailed explanation, I was not completely aware as to the exact specifics of how to set the leaf size.

Regarding the edge weights issue, I do have a reproducible example. I will head there and post shortly once I've collected everything together.

I'm not sure if I'm supposed to close this or you, but I'd consider my question closed :)

rcurtin commented 7 years ago

I don't think I should close this because no changes have been made to the documentation in RcppMLPACK. I only fixed it upstream in mlpack.

eddelbuettel commented 7 years ago

@peekxc I would welcome a PR with a new example / accessor in the RcppMLPACK2 repo -- it is based on the (current) MLPACK 2.1.0 -- and we're trying to work out how to possibly get it synced in over here (which works with an older, embedded version of MLPACK).