rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
28 stars 11 forks source link

provide seed for reproducible creation of Rfree flags #169

Closed dennisbrookner closed 1 year ago

dennisbrookner commented 1 year ago

I just had the unfortunate experience of re-running a cell in a jupyter notebook and regenerating all of my R-free flags 😭. While that is of course totally my fault, I'm also sure I won't be the last to do this.

To me, the solution is to allow a user to seed the random number generator, and then get reproducible R-free flags.

e.g.

def add_rfree(dataset, fraction=0.05, ccp4_convention=False, inplace=False):
    ...
    test_set = np.random.random(len(dataset)) <= fraction

would become something like

def add_rfree(dataset, fraction=0.05, ccp4_convention=False, inplace=False, seed=None):
    ...
    rng = np.random.default_rng(seed) # this accepts None! So convenient!
    test_set = rng.random(len(dataset)) <= fraction

This has the side benefit of switching us over to the new version of numpy random number generators, which I'm just now learning about, but are a definite improvement.

JBGreisman commented 1 year ago

I agree this sounds like a good idea, and is consistent with how users would expect this sort of consistency to be achieved.

I think this would be a good "first issue" for someone to work on. In the meantime, I will note that setting the random seed for numpy prior to the call to rs.utils.add_rfree() should get you a reproducible result:

In [1]: import numpy as np

In [2]: import reciprocalspaceship as rs

In [3]: ds = rs.read_mtz("tests/data/fmodel/9LYZ.mtz")

In [4]: def test_noseed(ds):
    ...:     flags1 = rs.utils.add_rfree(ds, fraction=0.5)["R-free-flags"]
    ...:     flags2 = rs.utils.add_rfree(ds, fraction=0.5)["R-free-flags"]
    ...:     return np.array_equal(flags1, flags2)
    ...: 

In [5]: def test_seed(ds):
    ...:     np.random.seed(100)
    ...:     flags1 = rs.utils.add_rfree(ds, fraction=0.5)["R-free-flags"]
    ...:     np.random.seed(100)
    ...:     flags2 = rs.utils.add_rfree(ds, fraction=0.5)["R-free-flags"]
    ...:     return np.array_equal(flags1, flags2)
    ...: 

In [6]: test_noseed(ds)
Out[6]: False

In [7]: test_seed(ds)
Out[7]: True
dennisbrookner commented 1 year ago

I just want to note in case anyone finds this thread in the future that

np.random.seed(seed)
np.random.random()

and

rng = np.random.default_rng(seed)
rng.random()

are actually not equivalent, because they use different random number generation algorithms. I only mention this because if you used the np.random.seed workaround above, and then switched over to the new rs function that uses np.random.default_rng, your flags would change.

I'm quite certain that this will not actually apply to anyone, because rs will contain the new functionality quite soon (and already does on the rfree branch), but I figured I would mention it here just in case.