Closed jnboehm closed 2 years ago
I'm also getting this issue while calling fit_transform
from the opentsne.sklearn submodule. Any thoughts on this @pavlin-policar ?
Hi, thanks for reporting this. This is an issue with hnswlib and annoy, which do not support being called with a RandomState
instance. I've implemented a workaround, so this will be fixed in the next release.
@pavlin-policar I think it would be great if the new commit would support the Generator
interface as well since it is not very different from RandomState
.
Hmm, could you please link to this Generator
? I've never heard of it before. Is it standard in Python?
Hmm, could you please link to this
Generator
? I've never heard of it before. Is it standard in Python?
Yes, this is the new Numpy random generator interface that replaces RandomState
. It was introduced in numpy v1.17 and is to be used going forward. The RandomState
class development has stopped and only exists now for backwards compatibility.
See: https://numpy.org/neps/nep-0019-rng-policy.html https://numpy.org/doc/stable/reference/random/generator.html
Also see this recent discussion on the scikit-learn issue tracker: https://github.com/scikit-learn/scikit-learn/issues/20669
Hmm, do you know if there's any way to extract the seed from the Generator
class? I don't see one.
RandomState
also doesn't have such an option, but I managed to come up with a hacky solution here.
We can always just generate a random int and use that for the seed in Annoy and hnswlib, but that could cause confusion.
I've closed this issue now, because the original issue is now fixed. I will add support for Generator
in a future PR if I think it makes sense. It seems as though RandomState will eventually be deprecated, so we'll have to do this eventually.
But in a more big-picture view, this isn't really a problem with openTSNE. All this is to accommodate for the fact that neither Annoy nor hnswlib play nice with numpy's RNG classes. openTSNE now has a number of workarounds that "fix" incorrect behaviour from our dependent libraries. I do not like this at all, since this means that every time these downstream dependencies make a change to their behaviour, we have adapt the workaround. I would much rather we open this issue with Annoy and hnswlib and encourage them to make their libraries play nice with the Python ecosystem.
Hmm, do you know if there's any way to extract the seed from the
Generator
class? I don't see one.
RandomState
also doesn't have such an option, but I managed to come up with a hacky solution here.We can always just generate a random int and use that for the seed in Annoy and hnswlib, but that could cause confusion.
The method you used is common for manipulating state in RandomState
. In Generator
you can used something like the following example:
>>> g = numpy.random.default_rng(12345)
>>> seed = g.bit_generator.state
>>> seed
{'bit_generator': 'PCG64', 'state': {'state': 33261208707367790463622745601869196757, 'inc': 268209174141567072605526753992732310247}, 'has_uint32': 0, 'uinteger': 0}
>>> new_g = numpy.random.default_rng()
>>> new_g.__setstate__(seed)
>>> new_g.bit_generator.state
{'bit_generator': 'PCG64', 'state': {'state': 33261208707367790463622745601869196757, 'inc': 268209174141567072605526753992732310247}, 'has_uint32': 0, 'uinteger': 0}
I do agree that there real issue should be fixed in the Annoy package, but it doesn't look like the package is actively being developed anymore. I will poke around the repo and see if someone has brought up the issue before.
This isn't really what I'm looking for. I would want to be able to extract the original seed from the random state -- in this case 12345. The reason being that if I call openTSNE with random_seed=12345, I would expect this to give me the same results as calling Annoy on its own with 12345. This isn't really what happens, but it's what I would expect.
Come to think of this now, my workaround probably doesn't really work because invoking the RNG will change the state, e.g.
>>> random_state = np.random.RandomState(42)
>>> random_state.get_state()[1][0]
42
>>> random_state.randint(10)
6
>>> random_state.get_state()[1][0]
723970371
In which case I see no solution to this problem, except generating a random int and seeding annoy with that.
This is what you'd be looking for in Generator
>>> g.bit_generator._seed_seq.entropy
12345
However this is bad practice to have multiple rng's initialized with the same state. It would make sense to pass a random value to Annoy that is governed by the generator initialized by 12345
.
This isn't really what I'm looking for. I would want to be able to extract the original seed from the random state -- in this case 12345. The reason being that if I call openTSNE with random_seed=12345, I would expect this to give me the same results as calling Annoy on its own with 12345. This isn't really what happens, but it's what I would expect.
Come to think of this now, my workaround probably doesn't really work because invoking the RNG will change the state, e.g.
>>> random_state = np.random.RandomState(42) >>> random_state.get_state()[1][0] 42 >>> random_state.randint(10) 6 >>> random_state.get_state()[1][0] 723970371
In which case I see no solution to this problem, except generating a random int and seeding annoy with that.
get_state
only gets the current state of the bitgenerator, not the entropy used to initialize it. To get the entropy you could just use the same approach outlined for generator
but instead use _bit_generator
instead of bit_generator
:
g._bit_generator._seed_seq.entropy
Yes, generating a random integer to seed annoy seems like the way to go then. The problem is now shifted to backwards compatibility. We rely on check_random_state
from scikit-learn to take care of the RNG and always get a RandomState instance. I don't think I want to add extra code to support Generator
until scikit-learn supports it as well. The big issue is that this was added in numpy 1.17, but I build my wheels on 1.16. Obviously, I can bump up the minimum numpy version, but I'd like to avoid this as much as possible.
I think I'll keep this as is for the time being, or at least until scikit-learn incorporates this as well. It seems like added complexity for almost no benefit for the time being.
Yes, generating a random integer to seed annoy seems like the way to go then. The problem is now shifted to backwards compatibility. We rely on
check_random_state
from scikit-learn to take care of the RNG and always get a RandomState instance. I don't think I want to add extra code to supportGenerator
until scikit-learn supports it as well. The big issue is that this was added in numpy 1.17, but I build my wheels on 1.16. Obviously, I can bump up the minimum numpy version, but I'd like to avoid this as much as possible.I think I'll keep this as is for the time being, or at least until scikit-learn incorporates this as well. It seems like added complexity for almost no benefit for the time being.
That is understandable. I would like to also bring to your attention the proposed Recommend Python and NumPy version support as a community policy standard introduced by NEP29. This is intended to help reduce compatibility issues for downstream packages that depend on numpy and other core package packages in the python eco-system.
I did implement a workaround for small project of mine that depends on scikit-learn. Here is the simple implementation that allows users to pass in both generator interfaces without worrying about support. Essentially, when a user passes in an RNG, it gets converted to a Generator
instance internally through the check_random_state
function. When that instance needs to be passed to scikit-learn classes, I convert it to a RandomState
with the safe_random_state
function. This approach keeps the same bit generator the user passed in and does not have to use hacks to make things work. See:
https://github.com/zoj613/pyloras/blob/9f27e7e25eb62dd90ca9883e3ee60f9c2ce348dd/pyloras/_common.py#L4-L19
Examples: https://github.com/zoj613/pyloras/blob/9f27e7e25eb62dd90ca9883e3ee60f9c2ce348dd/pyloras/_gamus.py#L142 https://github.com/zoj613/pyloras/blob/9f27e7e25eb62dd90ca9883e3ee60f9c2ce348dd/pyloras/_gamus.py#L136
Thanks for this link to the numpy deprecation cycle -- I wasn't aware of this, and I'll try to keep openTSNE up to date with this. By the looks of things, since officially, only numpy v1.18 is supported, the Generator
interface you're proposing should be completely fine to use, since it was added in 1.17.
Your workaround seems great, and I would be happy to review and incorporate this here if you open up a PR. It seems to me that since numpy is eventually going to deprecate RandomState
, scikit-learn will probably also go through a similar cycle, where they will have to support both RandomState
and Generator
at the same time. I mention this only because I want openTSNE to behave as similarly as possible to scikit-learn, since that's what everybody's used to.
Hi Pavlin,
this is quite a miniscule bug, but I noticed that when using
PerplexityBasedNN
it fails when you pass it a numpy RandomState instance as it uses that for the call to theAnnoyIndex(...).set_seed(seed)
call. Since the documentation says that is accepts both an integer and a numpy random state, I guess this is a (tiny) bug.Expected behaviour
It sets a seed for the internal random state of annoy.
Actual behaviour
It crashes with a TypeError:
Steps to reproduce the behavior
Fix
in
nearest_neighbors.py
line 275 can be changed fromself.index.set_seed(self.random_state)
toLet me know if it should come as a pull request or if you'll just incorporate it like this. Cheers