rcppmlpack / RcppMLPACK1

60 stars 14 forks source link

Setting seed for kmeans #23

Open gucouture opened 5 years ago

gucouture commented 5 years ago

Hi!

Is there any way to reproduce kmeans results by setting the seed? I don’t see how.

I tried every possible ways to manage with set.seed, arma seed, math::randomSeed, custom seed through refined start, and stuff. I finally found that without initialGuesses, the initials guesses are random. Thus, I went into the random_partition.hpp and changed

This line assignments = arma::shuffle(arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols));

by those lines (this example is using a seed = 11) : assignments = arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols); shuffle(assignments.begin(), assignments.end(), std::default_random_engine(11));

This is the only workaround I've found. It would be cool to enhance the module in order to reproduce results by setting easily a seed value.

Thank you!

thirdwing commented 5 years ago

Theoretically, arma_rng::set_seed should work.

I will look into this later.

coatless commented 5 years ago

@thirdwing needs to be set on the R level via set.seed(111).

eddelbuettel commented 5 years ago

Not if it is uses Arma's RNG and ... I think I just saw a commit for this recently?

coatless commented 5 years ago

@dirk I think you might be referring to the SO question involving gamma. By default, RcppArmadillo overrides the set_seed() behavior.

https://github.com/RcppCore/RcppArmadillo/blob/66fe84bc6993b6225e5d49cb2eac81e5e923cf3a/inst/include/RcppArmadillo/Alt_R_RNG.h#L61-L72


inline void arma_rng_alt::set_seed(const arma_rng_alt::seed_type val) {
    // null-op, cannot set seed in R from C level code
    // see http://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Random-numbers
    //
    // std::srand(val);
    (void) val;                 // to suppress a -Wunused warning
    //
    static int havewarned = 0;
    if (havewarned++ == 0) {
        ::Rf_warning("When called from R, the RNG seed has to be set at the R level via set.seed()");
    }
}

Since RcppMLPACK1 links out to RcppArmadillo this function is picked.

TL;DR Only R controls the seed except for newly added distributions which need its own issue in RcppArmadillo.

eddelbuettel commented 5 years ago

I am not.

thirdwing commented 5 years ago

The MLPACK in this repo is too old.

Actually, arma rng seed is not set its RandomSeed https://github.com/rcppmlpack/RcppMLPACK1/blob/master/src/mlpack/core/math/random.hpp#L63-L67

But this has been fixed recently.

eddelbuettel commented 5 years ago

That's what I thought -- I seem to recall a one line patch from you but I cannot find it any more. Where did you fix it?

gucouture commented 5 years ago

I saw the commit somewhere but setting also arma seed through the RandomSeed function does not work since the arma::shuffle function don't use it.

thirdwing commented 5 years ago

I fixed it in another branch and found it didn't help, so I deleted that branch.

eddelbuettel commented 5 years ago

Got it. At least I am not halluzinating.

Still too bad that @s-u ignores us and the rest of the world with the recipes repo. If that wasn't dead and ignored we could place MLPACK there. Oh well.

Would still be better for us to focus on RcppMLPACK2 here but the deployment...

gucouture commented 5 years ago

For the sake of it, here is my complete workaround

kmeans.hpp

Line 149 : Added “const int seed = 0” in order to set the seed for initial shuffling

kmeans_impl.hpp

Line 108 : Added “const int seed” in order to set the seed for initial shuffling Line 171 : Added the seed as the last parameter in order to set the seed for initial shuffling

random_partition.hpp

Line 57 : Added “const int seed” in order to set the seed for initial shuffling Lines 59 to 63 : Replacement of the shuffle function in order to insert the seed value

Here is the complete random_partition.hpp

/**
 * @file random_partition.hpp
 * @author Ryan Curtin
 *
 * Very simple partitioner which partitions the data randomly into the number of
 * desired clusters.  Used as the default InitialPartitionPolicy for KMeans.
 *
 * This file is part of MLPACK 1.0.10.
 *
 * MLPACK is free software: you can redistribute it and/or modify it under the
 * terms of the GNU Lesser General Public License as published by the Free
 * Software Foundation, either version 3 of the License, or (at your option) any
 * later version.
 *
 * MLPACK is distributed in the hope that it will be useful, but WITHOUT ANY
 * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
 * A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for more
 * details (LICENSE.txt).
 *
 * You should have received a copy of the GNU General Public License along with
 * MLPACK.  If not, see <http://www.gnu.org/licenses/>.
 */
#ifndef __MLPACK_METHODS_KMEANS_RANDOM_PARTITION_HPP
#define __MLPACK_METHODS_KMEANS_RANDOM_PARTITION_HPP

#include <mlpack/core.hpp>

namespace mlpack {
namespace kmeans {

/**
 * A very simple partitioner which partitions the data randomly into the number
 * of desired clusters.  It has no parameters, and so an instance of the class
 * is not even necessary.
 */
class RandomPartition
{
 public:
  //! Empty constructor, required by the InitialPartitionPolicy policy.
  RandomPartition() { }

  /**
   * Partition the given dataset into the given number of clusters.  Assignments
   * are random, and the number of points in each cluster should be equal (or
   * approximately equal).
   *
   * @tparam MatType Type of data (arma::mat or arma::sp_mat).
   * @param data Dataset to partition.
   * @param clusters Number of clusters to split dataset into.
   * @param assignments Vector to store cluster assignments into.  Values will
   *     be between 0 and (clusters - 1).
   */
  template<typename MatType>
  inline static void Cluster(const MatType& data,
                             const size_t clusters,
                             arma::Col<size_t>& assignments,
                             const int seed)
  {
    // Implementation is so simple we'll put it here in the header file.
    // assignments = arma::shuffle(arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols));

    assignments = arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols);
    shuffle(assignments.begin(), assignments.end(), std::default_random_engine(seed));
  }
};

};
};

#endif

Cheers!

thirdwing commented 5 years ago

Let's do it in this way:

  1. why arma::shuffle is not controlled by arma_rng.

  2. is it fixed in MLPACK 3?

rcurtin commented 5 years ago

is it fixed in MLPACK 3?

Well, it's not changed, but you could call mlpack::math::RandomSeed(seed) to set the seed both for Armadillo and all the other RNG support mlpack uses.

For what it's worth, and I know I've said a lot and produced only a little on this, but the mlpack command-line programs and Python bindings allow the random seed to be specified. If the R bindings were finished and polished, then you could call mlpack_kmeans() (or whatever we might call it) from R and specify the random seed as an option. However, my time keeps disappearing to more urgent things. @eddelbuettel ---I saw that R was accepted into GSoC this year; any interest in some kind of joint project or finding someone to write these bindings? I can help co-mentor, or even outside of GSoC I can provide some detailed guidance, but the time just isn't coming together for me to implement these right.

coatless commented 5 years ago

@rcurtin part of this is taken care of here: https://github.com/rcppmlpack/rcppmlpack2

The issue is listing on CRAN.

More than happy to act as a mentor. I'm not sure how much time @eddelbuettel has.

rcurtin commented 5 years ago

Right, I'm not too familiar with the CRAN issues or anything like that. I've talked with Dirk in email about this, but basically, we have an 'automatic bindings system' that lets us basically define, for each language, what the matrix types and integer types and string types are, and how to map between the C++ and target language types. Once all that is set up for a new language (e.g. R), then all of our "bindings" (basically 40+ machine learning methods written as C++ programs with macros defining the behavior, a la https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/dbscan/dbscan_main.cpp ) can automatically be compiled to provide bindings to the new language.

So, for a project, someone could implement the backend needed for R, and then suddenly we'd have R bindings that magically kept themselves up to date and were a part of core mlpack (meaning that you guys don't have to duplicate work "chasing" the new features we add, etc.). However, I don't know how exactly it should all look in R, and I'm not privy to the details of how to get it on CRAN.

If we can find someone (GSoC student or not) to put it together, I think it would be great. It would be a good project for that kind of timeframe (a few months). :+1:

rcurtin commented 5 years ago

Wait, I already wrote about this all and forgot in https://github.com/rcppmlpack/rcppmlpack2/issues/18 . Too many things going on... :confused:

eddelbuettel commented 5 years ago

Thumbs up for GSoC and/or other work on this. We should make these bindings happen. But then again we said several times before, and time is indeed elusive.

thirdwing commented 5 years ago

It seems everyone agrees with GSOC, let's put a project here: https://github.com/rstats-gsoc/gsoc2019/wiki

rcurtin commented 5 years ago

See also https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas#automatic-bindings-to-new-languages for some more detail. It might be better if the student comes from the R community, since there's definitely some amount of R knowledge that's going to be needed to make the bindings "look right" and follow the typical conventions of R machine learning algorithms.

eddelbuettel commented 5 years ago

I had my arm twisted by someone over from the R/Finance world so I am offering to be auxiliary mentor on one project -- but I'd be happy to pitch in here too if some of you (@rcurtin @thirdwing @coatless ...) can line up too.

rcurtin commented 5 years ago

If we can find a student with good C++ skills, I can walk them through what's needed from the C++ side and mentor that part of it. But I am also pretty busy, so I wouldn't be able to mentor alone. I suspect with a few of us we can do the job between us if we find a good student.

coatless commented 5 years ago

I'll throw together a project description for gsoc 2019. I may have a student in mind from CS @ UIUC.

Mentorwise, I think:

coatless commented 5 years ago

@thirdwing I'm assuming silence means you don't have time to be a GSOC mentor for the project?

@rcurtin + @eddelbuettel, I've added the following text to the R organization's GSOC wiki:

https://github.com/rstats-gsoc/gsoc2019/wiki/RcppMLPACK

thirdwing commented 5 years ago

@coatless Sorry for this. A little busy recently. If you need a backup mentor, you can add me.

I will have enough time after April.