Open fkiraly opened 9 months ago
In case you are getting deep into the weeds and need to consider independent random numbers in a distributed computing environment, you may need to think about the PRNG itself and not just the seed. FYI https://www.thesalmons.org/john/random123/papers/random123sc11.pdf
deep into the weeds
Ouch, not that deep into the weeds. I think we need to deal with the case of single location/env only, and leave pseudo-random seed handling for distributed environments to backends like joblib
.
@ericjb, I have come to realize that indeed we will probably need PRNG which guarantees pseudo-random independence with a tree-like hierarchy. As such, the linked paper is exactly of the kind I was looking for.
Pinging @johnsalmon, @moraesmark, @pbelevich regarding the paper - it would be great if there were a tool, possibly with python bindings which, for a tree-like structure of sampler objects can generate independent pseudo-random seeds such that all samplers end up (mutually) independent pseudo-random, if any node in the tree does node.set_random_seed(children, node.random_seed)
, assuming it has been set by its parents.
(the scope of this package is in-principle all of ML pipelines)
What is unfortunate that we do not know the size of the tree in advance, so a node doesn't know the number of its children, nor can it communicate with its parents. Otherwise, the solution would be "fairly trivial" by doing the following at the root: 1. compute the number of nodes, 2. run a PRNG sequence generator, 3. distribute the seeds for any enumeration, across the tree
Getting a bit deeper in the woods, one option would be to convolve each call to a dependent random seed with:
That would ensure uniqueness, pseudo-randomness, and pseudo-independence, as long as no line of code contains more than a call, and no two files the dependent seed generator is called from are identical (e.g., sth silly like near-empty __init__
files).
For reference and potential use in that, here is random code from stackoverflow that produces the line a function was called from:
from inspect import currentframe
def get_linenumber():
cf = currentframe()
return cf.f_back.f_lineno
Opening an issue to discuss API design around a requirement where independent, yet random-state-fixed copies of an estimator need to be obtained.
An example would be the bootstrap clones discussed here: https://github.com/sktime/sktime/discussions/5823 - these should be statistically independent pseudo-random.
Currently,
clone
copies therandom_seed
1:1, which results in:random_seed=None
, results in independent copies - but not pseudo-random fixed (each run gives different values)random_seed
is set, results in value-identical copies, not statistically independent pseudo-random copies - but pseudo-random fixed copiesNeither meets the requirement above, because that would ned to be both pseudo-random fixed, and statistically independent (not value-identical).
In light of the rework of
random_seed
functionality (see https://github.com/sktime/skbase/pull/268), it is worth a discussion how this should even look like from the API perspective.A key problem arises if multiple clones are needed - it needs to be known in advance, or at least they need to be sampled in a chain, to obtain dependent seeds which give rise to pseudo-random independent copies.
Further, we cannot change the default behaviour of
clone
and its current parameters, as it is an interface point of high importance.Options I can think of:
clone(deep=True, random_seed="exact_copy", n_clones=None)
clone_random(deep=True, n_clones=1)
FYI @ericjb, @jmwhyte, @tpvasconcelos - since we all discussed either
clone
orrandom_seed
recently.