have the equivalent of `set.seed` in Spark that applies to the entire Spark session

sparklyr / sparklyr

R interface for Apache Spark

Apache License 2.0

948 stars 307 forks source link

In an ideal world there should be a way for guaranteeing the "continuation" of the PRNG state from one call of sdf_* function to another, unless there is a set.seed call in between that prevents such "continuation" (i.e., by resetting the PRNG state, possibly replacing the previous PRNG seed with something else as well so that a different stream of pseudo-random numbers), but currently it's not immediately clear how this can be accomplished from documentation of Apache Spark, and also, whether this is feasible at all when data is distributed across multiple threads / multiple machines within a Spark cluster.

sparklyr / sparklyr

have the equivalent of `set.seed` in Spark that applies to the entire Spark session #2820