sparklyr / sparklyr

R interface for Apache Spark
https://spark.rstudio.com/
Apache License 2.0
948 stars 307 forks source link

have the equivalent of `set.seed` in Spark that applies to the entire Spark session #2820

Open yitao-li opened 3 years ago

yitao-li commented 3 years ago

In an ideal world there should be a way for guaranteeing the "continuation" of the PRNG state from one call of sdf_* function to another, unless there is a set.seed call in between that prevents such "continuation" (i.e., by resetting the PRNG state, possibly replacing the previous PRNG seed with something else as well so that a different stream of pseudo-random numbers), but currently it's not immediately clear how this can be accomplished from documentation of Apache Spark, and also, whether this is feasible at all when data is distributed across multiple threads / multiple machines within a Spark cluster.

yitao-li commented 3 years ago

Possibly a slightly separate topic, but when it comes to distributed computing with Apache Spark, ideally there should also be a way of ensuring PRNGs on multiple Spark workers don't produce correlated streams of pseudo-random numbers when such correlation is undesired, but currently it's easy to have such accidental correlations, by simply serializing the PRNG state and distributing the same state among multiple Spark workers. Ideally there should be some way of preventing that from happening (while still ensuring reproducibility, which could become tricky with Apache Spark).