mlampros / ClusterR

Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering
https://mlampros.github.io/ClusterR/
84 stars 29 forks source link

Skip set.seed() if NA_integer_ is given. #53

Closed heavywatal closed 3 months ago

heavywatal commented 3 months ago

This should address #52 but currently it does not work because NA_integer_ is implemented as the maximum negative value, and a negative seed causes an error "negative length vectors are not allowed". base::set.seed() accepts negative seeds without any problem, which indicates ClusterR is doing something wrong with negative seeds.

heavywatal commented 3 months ago

Sorry, my bad. I was doing wrong on initialization of Rcpp::IntegerVector. Now it accepts NA_integer_ as expected.

heavywatal commented 3 months ago

With this modification, it becomes possible to do this:

set.seed(1L)
KMeans_arma(iris[,-5], 3L, seed = NA_integer_)

which generates exactly the same result as this:

KMeans_arma(iris[,-5], 3L, seed = 1L)

It is also possible to do the same thing in Rcpp level.

mlampros commented 3 months ago

@heavywatal

Thanks for the pull request and the additional functionality. A few requests from my side:

heavywatal commented 3 months ago

OK, for now, I have added the minimal set of tests to validate the modification.

mlampros commented 3 months ago

Do you really think it is necessary to repeat this for every function with seed= parameter?

We have to make sure the previous version of the ClusterR and the updated version using your adjustments in the code do not give different results. Therefore, we have to test the functions which have as an argument the "seed"

Sorry I have never heard of ASAN or UBSAN

If you use compiled code in an R package then the code is tested for "Undefined Behaviour". Your adjustment which is the following line

if (Rcpp::all(Rcpp::is_na(Rcpp::IntegerVector(seed)))) return;

Will take an NA (missing value) as input and this might lead to undefined behaviour (or might not - I don't know that, because I don't know the internals of the Rcpp package).

I'll tell you from personal experience that I spent many hours in the past to fix errors from CRAN related to ASAN, UBSAN and avoid the exclusion of the package from CRAN.

heavywatal commented 3 months ago

We have to make sure the previous version of the ClusterR and the updated version using your adjustments in the code do not give different results.

I appreciate the purpose, but partially disagree on the conclusion:

Therefore, we have to test the functions which have as an argument the "seed"

because it should have been logically proven already by the combination of the existing tests and my additional tests. Is there any ClusterR function that uses seed in a different way from KMeans_arma()? Now that KMeans_arma() passes the tests, is it possible for KMeans_rcpp(), for example, to fail in terms of set_seed() modification? It seems to me that repetitive tests will just increase the code complexity uneconomically without much benefit.

But, OK, you are the author of this package, and I will add some tests for the other functions in the near future.

I'll tell you from personal experience that I spent many hours in the past to fix errors from CRAN related to ASAN, UBSAN and avoid the exclusion of the package from CRAN.

Thank you for the effort. You saved my day(s). Is it possible that we skip our local ASAN/UBSAN tests and give it a shot to let CRAN do it (hopefully automated)?

mlampros commented 3 months ago

Thank you for the effort. You saved my day(s)

By "spent time" I meant to debug previous error cases related to ASAN, UBSAN and not related to your current additions in the .cpp files. This is something you have to test/check using the existing packages such as rhub or any packages that you think it can serve the purpose.

Is it possible that we skip our local ASAN/UBSAN tests and give it a shot to let CRAN do it (hopefully automated)?

I'm sorry I can not do that. If you are not currently in place to perform the test/check then feel free to close the PR and re-open or create a new one once you are ready. thanks

mlampros commented 3 months ago

I'll close the PR for now. Feel free to re-open or open a new one.