Closed fsvbach closed 2 years ago
Thanks for reporting the issue. I think this is definitely something that should work, although technically, it's currently unsupported. Perhaps the latest release broke this behaviour.
Hi Pavlin,
Thanks, with the newest version 0.6.0 the above example works now. What do you think caused the behaviour?
UPDATE: The bug is not fixed. I just forgot the comment out the line "df = df.to_numpy()"...
Oh, so, in the newest release, this works? And it didn't work in earlier versions? If that's the case, we can close this issue, since we (inadvertently) fixed this. I have no idea what caused this issue, but then again, I don't think I've ever used openTSNE with pandas directly. It would be a good idea to add a test for this actually.
@fsvbach Can you maybe post the exact error you received before? You should see a specific line that used to trigger the error.
This is the error message I get:
Traceback (most recent call last): File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/fsvbach/ML MSc/MISC/#Issue 182/minimalexample.py", line 32, in
tsne.fit(df) File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/tsne.py", line 1219, in fit embedding = self.prepare_initial(X, affinities, initialization)
File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/tsne.py", line 1295, in prepare_initial affinities = PerplexityBasedNN(
File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/affinity.py", line 187, in init self.neighbors, self.distances = self.knn_index.build()
File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/openTSNE/nearest_neighbors.py", line 279, in build self.index.add_item(i, data[i])
File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 3024, in getitem indexer = self.columns.get_loc(key)
File "/home/fsvbach/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc raise KeyError(key) from err
KeyError: 0
Wait, so this works now? I don't think this should be working. I don't remember touching that particular part of the code. It makes sense that it fails, since indexing a pandas dataframe like that will select the column with name 0. A fix would also be fairly trivial -- just use df.values
instead of df
.
Well, it is not fixed, as I updated in the post above (I had forgotten to comment out df = df.to_numpy() in which case it always had been working).
With your proposal df.values() it works as well. But why does it sometimes also work (for small dataframes) without df.values() or df.to_numpy()?
Anyways, there is no a bug (and wasn't) as long as the column names are integers.
But why does it sometimes also work (for small dataframes) without df.values() or df.to_numpy()?
Because for small enough datasets, we use scikit-learn to compute nearest neighbors. Apparently, scikit-learn handles pandas dataframes without problem. For larger data sets, we use Annoy, which doesn't handle pandas.
Anyways, there is no a bug (and wasn't) as long as the column names are integers.
Actually, I think this should work incorrectly then. It doesn't crash, but it might be a silent bug. Because if we index a dataframe by columns, that's different from indexing a numpy array by rows. Could you check that the results are actually the same as when calling it with df.values
?
@pavlin-policar Based on the discussion above, this is not a bug. Rather, it's a feature request to add Pandas support to openTSNE. I guess this could be done along the lines of
if isinstance(X, pd.DataFrame):
X = X.values
however this will make openTSNE depend on Pandas, which I am not sure you want?
Can we use a hack like
if not isinstance(X, np.ndarray):
X = X.values
instead? Not sure what is the best practice.
Yes, I don't want to drag pandas into the requirements. After letting this sit for a long time, I don't see a clear solution here. The hack seems brittle, and would fail with a weird error message if I were to pass in a list, for instance. Perhaps the best course of action here would be to add a type check throughout the code to make sure the input is always a type we know how to work with, i.e., a numpy array or a scipy sparse matrix.
But adding type checks might be tedious, so I might even be okay leaving things as they are. It's pretty clear from the docstrings and documentation what types of objects we expect, so I think leaving it up to the user to pass in the correct type may be completely fine here as well. I don't feel strongly about this either way.
I agree -- I am fine with closing this issue.
I'm going to close this. I don't think this is an issue that pops up all that often, and since the end-users are, presumably, at least somewhat familiar with Python, I think the docstrings should provide sufficient information on the usage. If this ends up being a problem, we'll add type-checks everywhere.
Hi Pavlin,
I worked with openTSNE using pandas DataFrames as input for a while and it worked fine. Suddenly I got a KeyError. It seems that for larger DataFrames openTSNE brekas down (other than sklearn). This minimal example will reproduce the error:
Do you have an idea what could cause this, and whether it's worth while to extend openTSNE for pandas DataFrames?