grf-labs / grf

Generalized Random Forests
https://grf-labs.github.io/grf/
GNU General Public License v3.0
957 stars 248 forks source link

Non-integer seed? #1444

Open swager opened 1 day ago

swager commented 1 day ago

By default we give non-integer seeds to forests, e.g.:

https://github.com/grf-labs/grf/blob/5a2fbcd51bdeac67cbb85d2407a282d1962550e9/r-package/grf/R/regression_forest.R#L109C31-L109C71

It looks like this doesn't actually matter (i.e., the seed is in fact rounded down and it's OK), but still seems like explicitly using an integer seed ourselves could help avoid a future bug.

> set.seed(1)
> n = 100; p = 5
> X = matrix(rnorm(n * p), n, p)
> Y = rnorm(n)
> rf = regression_forest(X, Y, seed = 1)
> mean((rf$predictions - Y)^2)
[1] 0.9667780032
> rf = regression_forest(X, Y, seed = 1.234)
> mean((rf$predictions - Y)^2)
[1] 0.9667780032
> rf = regression_forest(X, Y, seed = 1.99)
> mean((rf$predictions - Y)^2)
[1] 0.9667780032
> rf = regression_forest(X, Y, seed = 2)
> mean((rf$predictions - Y)^2)
[1] 0.9692135702
> rf = regression_forest(X, Y, seed = exp(1))
> mean((rf$predictions - Y)^2)
[1] 0.9692135702
erikcs commented 1 day ago

Hmm, yes, I think all these pseudorandom generators operate with modular arithmetic so the fractional part of a number will never make a difference.

I see how doing runif by default is maybe a tiny bit unidiomatic from that perspective, but we can't change the default now as that will be a breaking change? I.e, set.seed(42); causal_forest(X,Y,Z), will currently produce the same result in all grf versions (ignoring the num.threads bit).

Are you worrying about the chance that runif will cause a rounding collision where you expect to get different result but don't because:

set.seed(something)
runif(1, 0, .Machine$integer.max)) happens to be 42.2
causal_forest(....)

set.seed(something else)
runif(1, 0, .Machine$integer.max)) happens to be 42.4
causal_forest(...) <- same output as above

Since the seed is runif between 0 and .Machine$integer.max that seems very unlikely, or is there another potential bug/unwanted behavior you had in mind?

erikcs commented 15 hours ago

I completely forgot to paste this, Base R also interprets all seeds as integers:

> set.seed(1)
> runif(1)
[1] 0.2655087
> set.seed(1.4)
> runif(1)
[1] 0.2655087