Open yitao-li opened 3 years ago
Possibly a slightly separate topic, but when it comes to distributed computing with Apache Spark, ideally there should also be a way of ensuring PRNGs on multiple Spark workers don't produce correlated streams of pseudo-random numbers when such correlation is undesired, but currently it's easy to have such accidental correlations, by simply serializing the PRNG state and distributing the same state among multiple Spark workers. Ideally there should be some way of preventing that from happening (while still ensuring reproducibility, which could become tricky with Apache Spark).
In an ideal world there should be a way for guaranteeing the "continuation" of the PRNG state from one call of
sdf_*
function to another, unless there is aset.seed
call in between that prevents such "continuation" (i.e., by resetting the PRNG state, possibly replacing the previous PRNG seed with something else as well so that a different stream of pseudo-random numbers), but currently it's not immediately clear how this can be accomplished from documentation of Apache Spark, and also, whether this is feasible at all when data is distributed across multiple threads / multiple machines within a Spark cluster.