sktime / skbase

Base classes for creating scikit-learn-like parametric objects, and tools for working with them.
BSD 3-Clause "New" or "Revised" License
18 stars 11 forks source link

[ENH] handling of `random_state` in `clone` #279

Open fkiraly opened 9 months ago

fkiraly commented 9 months ago

Opening an issue to discuss API design around a requirement where independent, yet random-state-fixed copies of an estimator need to be obtained.

An example would be the bootstrap clones discussed here: https://github.com/sktime/sktime/discussions/5823 - these should be statistically independent pseudo-random.

Currently, clone copies the random_seed 1:1, which results in:

Neither meets the requirement above, because that would ned to be both pseudo-random fixed, and statistically independent (not value-identical).

In light of the rework of random_seed functionality (see https://github.com/sktime/skbase/pull/268), it is worth a discussion how this should even look like from the API perspective.

A key problem arises if multiple clones are needed - it needs to be known in advance, or at least they need to be sampled in a chain, to obtain dependent seeds which give rise to pseudo-random independent copies.

Further, we cannot change the default behaviour of clone and its current parameters, as it is an interface point of high importance.

Options I can think of:

FYI @ericjb, @jmwhyte, @tpvasconcelos - since we all discussed either clone or random_seed recently.

ericjb commented 9 months ago

In case you are getting deep into the weeds and need to consider independent random numbers in a distributed computing environment, you may need to think about the PRNG itself and not just the seed. FYI https://www.thesalmons.org/john/random123/papers/random123sc11.pdf

fkiraly commented 9 months ago

deep into the weeds

Ouch, not that deep into the weeds. I think we need to deal with the case of single location/env only, and leave pseudo-random seed handling for distributed environments to backends like joblib.

fkiraly commented 8 months ago

@ericjb, I have come to realize that indeed we will probably need PRNG which guarantees pseudo-random independence with a tree-like hierarchy. As such, the linked paper is exactly of the kind I was looking for.

fkiraly commented 8 months ago

Pinging @johnsalmon, @moraesmark, @pbelevich regarding the paper - it would be great if there were a tool, possibly with python bindings which, for a tree-like structure of sampler objects can generate independent pseudo-random seeds such that all samplers end up (mutually) independent pseudo-random, if any node in the tree does node.set_random_seed(children, node.random_seed), assuming it has been set by its parents.

(the scope of this package is in-principle all of ML pipelines)

What is unfortunate that we do not know the size of the tree in advance, so a node doesn't know the number of its children, nor can it communicate with its parents. Otherwise, the solution would be "fairly trivial" by doing the following at the root: 1. compute the number of nodes, 2. run a PRNG sequence generator, 3. distribute the seeds for any enumeration, across the tree

fkiraly commented 8 months ago

Getting a bit deeper in the woods, one option would be to convolve each call to a dependent random seed with:

That would ensure uniqueness, pseudo-randomness, and pseudo-independence, as long as no line of code contains more than a call, and no two files the dependent seed generator is called from are identical (e.g., sth silly like near-empty __init__ files).

For reference and potential use in that, here is random code from stackoverflow that produces the line a function was called from:

from inspect import currentframe

def get_linenumber():
    cf = currentframe()
    return cf.f_back.f_lineno