pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.44k stars 158 forks source link

Cannot pass random_state to PerplexityBasedNN when using Annoy #189

Closed jnboehm closed 2 years ago

jnboehm commented 3 years ago

Hi Pavlin,

this is quite a miniscule bug, but I noticed that when using PerplexityBasedNN it fails when you pass it a numpy RandomState instance as it uses that for the call to the AnnoyIndex(...).set_seed(seed) call. Since the documentation says that is accepts both an integer and a numpy random state, I guess this is a (tiny) bug.

Expected behaviour

It sets a seed for the internal random state of annoy.

Actual behaviour

It crashes with a TypeError:

  File "/home/jnb/dev/openTSNE/openTSNE/nearest_neighbors.py", line 276, in build
    self.index.set_seed(self.random_state)
TypeError: an integer is required (got type numpy.random.mtrand.RandomState)
Steps to reproduce the behavior
import numpy as np
from openTSNE import PerplexityBasedNN

random_state = np.random.default_rng(333)
data = random_state.uniform(size=(10000,10))
PerplexityBasedNN(rdata, andom_state=random_state)

Fix

in nearest_neighbors.py line 275 can be changed from self.index.set_seed(self.random_state) to

        if isinstance(self.random_state, int):
            self.index.set_seed(self.random_state)
        else: # has to be a numpy RandomState
            self.index.set_seed(self.random_state.randint(-(2 ** 31), 2 ** 31))

Let me know if it should come as a pull request or if you'll just incorporate it like this. Cheers

zoj613 commented 3 years ago

I'm also getting this issue while calling fit_transform from the opentsne.sklearn submodule. Any thoughts on this @pavlin-policar ?

pavlin-policar commented 2 years ago

Hi, thanks for reporting this. This is an issue with hnswlib and annoy, which do not support being called with a RandomState instance. I've implemented a workaround, so this will be fixed in the next release.

zoj613 commented 2 years ago

@pavlin-policar I think it would be great if the new commit would support the Generator interface as well since it is not very different from RandomState.

pavlin-policar commented 2 years ago

Hmm, could you please link to this Generator? I've never heard of it before. Is it standard in Python?

zoj613 commented 2 years ago

Hmm, could you please link to this Generator? I've never heard of it before. Is it standard in Python?

Yes, this is the new Numpy random generator interface that replaces RandomState. It was introduced in numpy v1.17 and is to be used going forward. The RandomState class development has stopped and only exists now for backwards compatibility.

See: https://numpy.org/neps/nep-0019-rng-policy.html https://numpy.org/doc/stable/reference/random/generator.html

zoj613 commented 2 years ago

Also see this recent discussion on the scikit-learn issue tracker: https://github.com/scikit-learn/scikit-learn/issues/20669

pavlin-policar commented 2 years ago

Hmm, do you know if there's any way to extract the seed from the Generator class? I don't see one.

RandomState also doesn't have such an option, but I managed to come up with a hacky solution here.

We can always just generate a random int and use that for the seed in Annoy and hnswlib, but that could cause confusion.

pavlin-policar commented 2 years ago

I've closed this issue now, because the original issue is now fixed. I will add support for Generator in a future PR if I think it makes sense. It seems as though RandomState will eventually be deprecated, so we'll have to do this eventually.

But in a more big-picture view, this isn't really a problem with openTSNE. All this is to accommodate for the fact that neither Annoy nor hnswlib play nice with numpy's RNG classes. openTSNE now has a number of workarounds that "fix" incorrect behaviour from our dependent libraries. I do not like this at all, since this means that every time these downstream dependencies make a change to their behaviour, we have adapt the workaround. I would much rather we open this issue with Annoy and hnswlib and encourage them to make their libraries play nice with the Python ecosystem.

zoj613 commented 2 years ago

Hmm, do you know if there's any way to extract the seed from the Generator class? I don't see one.

RandomState also doesn't have such an option, but I managed to come up with a hacky solution here.

We can always just generate a random int and use that for the seed in Annoy and hnswlib, but that could cause confusion.

The method you used is common for manipulating state in RandomState. In Generator you can used something like the following example:

>>> g = numpy.random.default_rng(12345)
>>> seed = g.bit_generator.state
>>> seed
{'bit_generator': 'PCG64', 'state': {'state': 33261208707367790463622745601869196757, 'inc': 268209174141567072605526753992732310247}, 'has_uint32': 0, 'uinteger': 0}
>>> new_g = numpy.random.default_rng()
>>> new_g.__setstate__(seed)
>>> new_g.bit_generator.state
{'bit_generator': 'PCG64', 'state': {'state': 33261208707367790463622745601869196757, 'inc': 268209174141567072605526753992732310247}, 'has_uint32': 0, 'uinteger': 0}

I do agree that there real issue should be fixed in the Annoy package, but it doesn't look like the package is actively being developed anymore. I will poke around the repo and see if someone has brought up the issue before.

pavlin-policar commented 2 years ago

This isn't really what I'm looking for. I would want to be able to extract the original seed from the random state -- in this case 12345. The reason being that if I call openTSNE with random_seed=12345, I would expect this to give me the same results as calling Annoy on its own with 12345. This isn't really what happens, but it's what I would expect.

Come to think of this now, my workaround probably doesn't really work because invoking the RNG will change the state, e.g.

>>> random_state = np.random.RandomState(42)
>>> random_state.get_state()[1][0]
42
>>> random_state.randint(10)
6
>>> random_state.get_state()[1][0]
723970371

In which case I see no solution to this problem, except generating a random int and seeding annoy with that.

zoj613 commented 2 years ago

This is what you'd be looking for in Generator

>>> g.bit_generator._seed_seq.entropy
12345

However this is bad practice to have multiple rng's initialized with the same state. It would make sense to pass a random value to Annoy that is governed by the generator initialized by 12345.

This isn't really what I'm looking for. I would want to be able to extract the original seed from the random state -- in this case 12345. The reason being that if I call openTSNE with random_seed=12345, I would expect this to give me the same results as calling Annoy on its own with 12345. This isn't really what happens, but it's what I would expect.

Come to think of this now, my workaround probably doesn't really work because invoking the RNG will change the state, e.g.

>>> random_state = np.random.RandomState(42)
>>> random_state.get_state()[1][0]
42
>>> random_state.randint(10)
6
>>> random_state.get_state()[1][0]
723970371

In which case I see no solution to this problem, except generating a random int and seeding annoy with that.

get_state only gets the current state of the bitgenerator, not the entropy used to initialize it. To get the entropy you could just use the same approach outlined for generator but instead use _bit_generator instead of bit_generator:

g._bit_generator._seed_seq.entropy
pavlin-policar commented 2 years ago

Yes, generating a random integer to seed annoy seems like the way to go then. The problem is now shifted to backwards compatibility. We rely on check_random_state from scikit-learn to take care of the RNG and always get a RandomState instance. I don't think I want to add extra code to support Generator until scikit-learn supports it as well. The big issue is that this was added in numpy 1.17, but I build my wheels on 1.16. Obviously, I can bump up the minimum numpy version, but I'd like to avoid this as much as possible.

I think I'll keep this as is for the time being, or at least until scikit-learn incorporates this as well. It seems like added complexity for almost no benefit for the time being.

zoj613 commented 2 years ago

Yes, generating a random integer to seed annoy seems like the way to go then. The problem is now shifted to backwards compatibility. We rely on check_random_state from scikit-learn to take care of the RNG and always get a RandomState instance. I don't think I want to add extra code to support Generator until scikit-learn supports it as well. The big issue is that this was added in numpy 1.17, but I build my wheels on 1.16. Obviously, I can bump up the minimum numpy version, but I'd like to avoid this as much as possible.

I think I'll keep this as is for the time being, or at least until scikit-learn incorporates this as well. It seems like added complexity for almost no benefit for the time being.

That is understandable. I would like to also bring to your attention the proposed Recommend Python and NumPy version support as a community policy standard introduced by NEP29. This is intended to help reduce compatibility issues for downstream packages that depend on numpy and other core package packages in the python eco-system.

I did implement a workaround for small project of mine that depends on scikit-learn. Here is the simple implementation that allows users to pass in both generator interfaces without worrying about support. Essentially, when a user passes in an RNG, it gets converted to a Generator instance internally through the check_random_state function. When that instance needs to be passed to scikit-learn classes, I convert it to a RandomState with the safe_random_state function. This approach keeps the same bit generator the user passed in and does not have to use hacks to make things work. See: https://github.com/zoj613/pyloras/blob/9f27e7e25eb62dd90ca9883e3ee60f9c2ce348dd/pyloras/_common.py#L4-L19

Examples: https://github.com/zoj613/pyloras/blob/9f27e7e25eb62dd90ca9883e3ee60f9c2ce348dd/pyloras/_gamus.py#L142 https://github.com/zoj613/pyloras/blob/9f27e7e25eb62dd90ca9883e3ee60f9c2ce348dd/pyloras/_gamus.py#L136

pavlin-policar commented 2 years ago

Thanks for this link to the numpy deprecation cycle -- I wasn't aware of this, and I'll try to keep openTSNE up to date with this. By the looks of things, since officially, only numpy v1.18 is supported, the Generator interface you're proposing should be completely fine to use, since it was added in 1.17.

Your workaround seems great, and I would be happy to review and incorporate this here if you open up a PR. It seems to me that since numpy is eventually going to deprecate RandomState, scikit-learn will probably also go through a similar cycle, where they will have to support both RandomState and Generator at the same time. I mention this only because I want openTSNE to behave as similarly as possible to scikit-learn, since that's what everybody's used to.