Open rcoconnell opened 4 years ago
I suspect the issue is that the transform method is failing to find 15 nearest neighbors for those bad rows -- they are too dissimilar to anything else (due, in large part, to their extreme sparsity) for the approximate nearest neighbor search to find results. I'm not sure if there is an easy solution to this within the framework of the current codebase.
Yeah, I was worried that might be the case. Is there a way for it to fail more gracefully? In my case I don’t mind if a few rows just can’t be embedded, the problem is that transform doesn’t return anything if there are any bad rows.
Alternatively, is there a way to do my own separate knn check before passing the new data through to transform?
There might be some small modifications that you could make to just do a bad job for the bad rows. I'll see if I can come up with some minimal code changes and get back to you.
Here's one possible approach that may work for you:
at line 2129 insert something like:
bad_indices = (indices == -1)
dists[bad_indices] = np.max(dists)
indices[bad_indices] = np.random.randint(self._raw_data.shape[0], size=np.sum(bad_indices))
That may resolve the issue... assuming I have diagnosed it correctly.
Thanks! I wound up with the following:
index_check = np.array([len(np.unique(x)) for x in indices])
bad_indices = (index_check < self._n_neighbors)
dists[bad_indices] = np.max(dists)
indices[bad_indices] = np.random.randint(self._raw_data.shape[0], size=(np.sum(bad_indices),self._n_neighbors))
That may be enough to close this issue, but now that it's running I've been confused by the results of applying transform
. I started by generating a 2D embedding for my full data set. I did a train-test split, and plotted the embedding for both train and test. I also flagged a cluster (identified by HDBScan, here in red) of points to track. No surprises here -- the train-test split is random, so the embeddings and cluster look the same.
Next, I trained a new 2D embedding on the training set, and applied it (using transform
) to the test set. When I plot the embeddings, the test set matches the broad outlines of the training set, but it missing most of the internal structure. If I track the same cluster from above, it's well-concentrated in the training set, but blown apart in the test set. The green points are those with a bad embedding -- they represent <1% of the test set.
I wasn't sure if this was expected behavior or not. When computing the embeddings, I used the Hellinger metric -- is that a choice that transform
will respect, or could it be using a different metric?
First, thanks for putting such a great project together!
I'm having trouble with the transform method. I'm training on TFIDF vectors from a large set of text data (>100k rows, 20k features), of type scipy.sparse.csr.csr_matrix. I'm then applying to new set of test data encoded in the same way. I seem to have some "bad" rows, and get an error if I try to transform the full test set, or an individual "bad" row. That error is:
The only unusual thing I've found about the "bad" rows is that they have a relatively small number of non-zero entries. For example, I looked for all rows with <10 non-zero entries, and most (but not all!) of them produced the error above. If I take a random sample of rows, rather than hunting for bad ones, the failure rate is <1%.
If I can provide anything else that would be useful, please let me know!