pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.48k stars 165 forks source link

pandas DataFrames KeyError #182

Closed fsvbach closed 2 years ago

fsvbach commented 3 years ago

Hi Pavlin,

I worked with openTSNE using pandas DataFrames as input for a while and it worked fine. Suddenly I got a KeyError. It seems that for larger DataFrames openTSNE brekas down (other than sklearn). This minimal example will reproduce the error:

from openTSNE import TSNE
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(2000, 4)), 
                  columns=list('ABCD'))

tsne = TSNE()

## adding this line solves the problem entirely
# df  = df.to_numpy()

## this works
tsne.fit(df[:500])

## this doesn't work
tsne.fit(df)

Do you have an idea what could cause this, and whether it's worth while to extend openTSNE for pandas DataFrames?

pavlin-policar commented 3 years ago

Thanks for reporting the issue. I think this is definitely something that should work, although technically, it's currently unsupported. Perhaps the latest release broke this behaviour.

fsvbach commented 3 years ago

Hi Pavlin,

Thanks, with the newest version 0.6.0 the above example works now. What do you think caused the behaviour?

UPDATE: The bug is not fixed. I just forgot the comment out the line "df = df.to_numpy()"...

pavlin-policar commented 3 years ago

Oh, so, in the newest release, this works? And it didn't work in earlier versions? If that's the case, we can close this issue, since we (inadvertently) fixed this. I have no idea what caused this issue, but then again, I don't think I've ever used openTSNE with pandas directly. It would be a good idea to add a test for this actually.

dkobak commented 3 years ago

@fsvbach Can you maybe post the exact error you received before? You should see a specific line that used to trigger the error.

fsvbach commented 3 years ago

This is the error message I get:

Traceback (most recent call last): File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc return self._engine.get_loc(casted_key)

File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc

File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc

File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item

File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item

KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/home/fsvbach/ML MSc/MISC/#Issue 182/minimalexample.py", line 32, in tsne.fit(df)

File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/tsne.py", line 1219, in fit embedding = self.prepare_initial(X, affinities, initialization)

File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/tsne.py", line 1295, in prepare_initial affinities = PerplexityBasedNN(

File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/affinity.py", line 187, in init self.neighbors, self.distances = self.knn_index.build()

File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/nearest_neighbors.py", line 279, in build self.index.add_item(i, data[i])

File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in getitem indexer = self.columns.get_loc(key)

File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc raise KeyError(key) from err

KeyError: 0

pavlin-policar commented 3 years ago

Wait, so this works now? I don't think this should be working. I don't remember touching that particular part of the code. It makes sense that it fails, since indexing a pandas dataframe like that will select the column with name 0. A fix would also be fairly trivial -- just use df.values instead of df.

fsvbach commented 3 years ago

Well, it is not fixed, as I updated in the post above (I had forgotten to comment out df = df.to_numpy() in which case it always had been working).

With your proposal df.values() it works as well. But why does it sometimes also work (for small dataframes) without df.values() or df.to_numpy()?

Anyways, there is no a bug (and wasn't) as long as the column names are integers.

pavlin-policar commented 3 years ago

But why does it sometimes also work (for small dataframes) without df.values() or df.to_numpy()?

Because for small enough datasets, we use scikit-learn to compute nearest neighbors. Apparently, scikit-learn handles pandas dataframes without problem. For larger data sets, we use Annoy, which doesn't handle pandas.

Anyways, there is no a bug (and wasn't) as long as the column names are integers.

Actually, I think this should work incorrectly then. It doesn't crash, but it might be a silent bug. Because if we index a dataframe by columns, that's different from indexing a numpy array by rows. Could you check that the results are actually the same as when calling it with df.values?

dkobak commented 2 years ago

@pavlin-policar Based on the discussion above, this is not a bug. Rather, it's a feature request to add Pandas support to openTSNE. I guess this could be done along the lines of

if isinstance(X, pd.DataFrame):
   X = X.values

however this will make openTSNE depend on Pandas, which I am not sure you want?

Can we use a hack like

if not isinstance(X, np.ndarray):
   X = X.values

instead? Not sure what is the best practice.

pavlin-policar commented 2 years ago

Yes, I don't want to drag pandas into the requirements. After letting this sit for a long time, I don't see a clear solution here. The hack seems brittle, and would fail with a weird error message if I were to pass in a list, for instance. Perhaps the best course of action here would be to add a type check throughout the code to make sure the input is always a type we know how to work with, i.e., a numpy array or a scipy sparse matrix.

But adding type checks might be tedious, so I might even be okay leaving things as they are. It's pretty clear from the docstrings and documentation what types of objects we expect, so I think leaving it up to the user to pass in the correct type may be completely fine here as well. I don't feel strongly about this either way.

dkobak commented 2 years ago

I agree -- I am fine with closing this issue.

pavlin-policar commented 2 years ago

I'm going to close this. I don't think this is an issue that pops up all that often, and since the end-users are, presumably, at least somewhat familiar with Python, I think the docstrings should provide sufficient information on the usage. If this ends up being a problem, we'll add type-checks everywhere.