Open gucouture opened 5 years ago
Theoretically, arma_rng::set_seed
should work.
I will look into this later.
@thirdwing needs to be set on the R level via set.seed(111)
.
Not if it is uses Arma's RNG and ... I think I just saw a commit for this recently?
@dirk I think you might be referring to the SO question involving gamma. By default, RcppArmadillo
overrides the set_seed()
behavior.
inline void arma_rng_alt::set_seed(const arma_rng_alt::seed_type val) {
// null-op, cannot set seed in R from C level code
// see http://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Random-numbers
//
// std::srand(val);
(void) val; // to suppress a -Wunused warning
//
static int havewarned = 0;
if (havewarned++ == 0) {
::Rf_warning("When called from R, the RNG seed has to be set at the R level via set.seed()");
}
}
Since RcppMLPACK1
links out to RcppArmadillo
this function is picked.
TL;DR Only R controls the seed except for newly added distributions which need its own issue in RcppArmadillo
.
I am not.
The MLPACK in this repo is too old.
Actually, arma rng seed is not set its RandomSeed
https://github.com/rcppmlpack/RcppMLPACK1/blob/master/src/mlpack/core/math/random.hpp#L63-L67
But this has been fixed recently.
That's what I thought -- I seem to recall a one line patch from you but I cannot find it any more. Where did you fix it?
I saw the commit somewhere but setting also arma seed through the RandomSeed function does not work since the arma::shuffle function don't use it.
I fixed it in another branch and found it didn't help, so I deleted that branch.
Got it. At least I am not halluzinating.
Still too bad that @s-u ignores us and the rest of the world with the recipes repo. If that wasn't dead and ignored we could place MLPACK there. Oh well.
Would still be better for us to focus on RcppMLPACK2 here but the deployment...
For the sake of it, here is my complete workaround
kmeans.hpp
Line 149 : Added “const int seed = 0” in order to set the seed for initial shuffling
kmeans_impl.hpp
Line 108 : Added “const int seed” in order to set the seed for initial shuffling Line 171 : Added the seed as the last parameter in order to set the seed for initial shuffling
random_partition.hpp
Line 57 : Added “const int seed” in order to set the seed for initial shuffling Lines 59 to 63 : Replacement of the shuffle function in order to insert the seed value
Here is the complete random_partition.hpp
/**
* @file random_partition.hpp
* @author Ryan Curtin
*
* Very simple partitioner which partitions the data randomly into the number of
* desired clusters. Used as the default InitialPartitionPolicy for KMeans.
*
* This file is part of MLPACK 1.0.10.
*
* MLPACK is free software: you can redistribute it and/or modify it under the
* terms of the GNU Lesser General Public License as published by the Free
* Software Foundation, either version 3 of the License, or (at your option) any
* later version.
*
* MLPACK is distributed in the hope that it will be useful, but WITHOUT ANY
* WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
* A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more
* details (LICENSE.txt).
*
* You should have received a copy of the GNU General Public License along with
* MLPACK. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef __MLPACK_METHODS_KMEANS_RANDOM_PARTITION_HPP
#define __MLPACK_METHODS_KMEANS_RANDOM_PARTITION_HPP
#include <mlpack/core.hpp>
namespace mlpack {
namespace kmeans {
/**
* A very simple partitioner which partitions the data randomly into the number
* of desired clusters. It has no parameters, and so an instance of the class
* is not even necessary.
*/
class RandomPartition
{
public:
//! Empty constructor, required by the InitialPartitionPolicy policy.
RandomPartition() { }
/**
* Partition the given dataset into the given number of clusters. Assignments
* are random, and the number of points in each cluster should be equal (or
* approximately equal).
*
* @tparam MatType Type of data (arma::mat or arma::sp_mat).
* @param data Dataset to partition.
* @param clusters Number of clusters to split dataset into.
* @param assignments Vector to store cluster assignments into. Values will
* be between 0 and (clusters - 1).
*/
template<typename MatType>
inline static void Cluster(const MatType& data,
const size_t clusters,
arma::Col<size_t>& assignments,
const int seed)
{
// Implementation is so simple we'll put it here in the header file.
// assignments = arma::shuffle(arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols));
assignments = arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols);
shuffle(assignments.begin(), assignments.end(), std::default_random_engine(seed));
}
};
};
};
#endif
Cheers!
Let's do it in this way:
why arma::shuffle
is not controlled by arma_rng
.
is it fixed in MLPACK 3?
is it fixed in MLPACK 3?
Well, it's not changed, but you could call mlpack::math::RandomSeed(seed)
to set the seed both for Armadillo and all the other RNG support mlpack uses.
For what it's worth, and I know I've said a lot and produced only a little on this, but the mlpack command-line programs and Python bindings allow the random seed to be specified. If the R bindings were finished and polished, then you could call mlpack_kmeans()
(or whatever we might call it) from R and specify the random seed as an option. However, my time keeps disappearing to more urgent things. @eddelbuettel ---I saw that R was accepted into GSoC this year; any interest in some kind of joint project or finding someone to write these bindings? I can help co-mentor, or even outside of GSoC I can provide some detailed guidance, but the time just isn't coming together for me to implement these right.
@rcurtin part of this is taken care of here: https://github.com/rcppmlpack/rcppmlpack2
The issue is listing on CRAN.
More than happy to act as a mentor. I'm not sure how much time @eddelbuettel has.
Right, I'm not too familiar with the CRAN issues or anything like that. I've talked with Dirk in email about this, but basically, we have an 'automatic bindings system' that lets us basically define, for each language, what the matrix types and integer types and string types are, and how to map between the C++ and target language types. Once all that is set up for a new language (e.g. R), then all of our "bindings" (basically 40+ machine learning methods written as C++ programs with macros defining the behavior, a la https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/dbscan/dbscan_main.cpp ) can automatically be compiled to provide bindings to the new language.
So, for a project, someone could implement the backend needed for R, and then suddenly we'd have R bindings that magically kept themselves up to date and were a part of core mlpack (meaning that you guys don't have to duplicate work "chasing" the new features we add, etc.). However, I don't know how exactly it should all look in R, and I'm not privy to the details of how to get it on CRAN.
If we can find someone (GSoC student or not) to put it together, I think it would be great. It would be a good project for that kind of timeframe (a few months). :+1:
Wait, I already wrote about this all and forgot in https://github.com/rcppmlpack/rcppmlpack2/issues/18 . Too many things going on... :confused:
Thumbs up for GSoC and/or other work on this. We should make these bindings happen. But then again we said several times before, and time is indeed elusive.
It seems everyone agrees with GSOC, let's put a project here: https://github.com/rstats-gsoc/gsoc2019/wiki
See also https://github.com/mlpack/mlpack/wiki/SummerOfCodeIdeas#automatic-bindings-to-new-languages for some more detail. It might be better if the student comes from the R community, since there's definitely some amount of R knowledge that's going to be needed to make the bindings "look right" and follow the typical conventions of R machine learning algorithms.
I had my arm twisted by someone over from the R/Finance world so I am offering to be auxiliary mentor on one project -- but I'd be happy to pitch in here too if some of you (@rcurtin @thirdwing @coatless ...) can line up too.
If we can find a student with good C++ skills, I can walk them through what's needed from the C++ side and mentor that part of it. But I am also pretty busy, so I wouldn't be able to mentor alone. I suspect with a few of us we can do the job between us if we find a good student.
I'll throw together a project description for gsoc 2019. I may have a student in mind from CS @ UIUC.
Mentorwise, I think:
@thirdwing I'm assuming silence means you don't have time to be a GSOC mentor for the project?
@rcurtin + @eddelbuettel, I've added the following text to the R organization's GSOC wiki:
@coatless Sorry for this. A little busy recently. If you need a backup mentor, you can add me.
I will have enough time after April.
Hi!
Is there any way to reproduce kmeans results by setting the seed? I don’t see how.
I tried every possible ways to manage with set.seed, arma seed, math::randomSeed, custom seed through refined start, and stuff. I finally found that without initialGuesses, the initials guesses are random. Thus, I went into the random_partition.hpp and changed
This line
assignments = arma::shuffle(arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols));
by those lines (this example is using a seed = 11) :
assignments = arma::linspace<arma::Col<size_t> >(0, (clusters - 1), data.n_cols); shuffle(assignments.begin(), assignments.end(), std::default_random_engine(11));
This is the only workaround I've found. It would be cool to enhance the module in order to reproduce results by setting easily a seed value.
Thank you!