lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 808 forks source link

`Overflow encountered in true_divide error` when using Aligned UMAP #635

Open bianchi-dy opened 3 years ago

bianchi-dy commented 3 years ago

Hi! I'm a relatively new UMAP user, using Aligned UMAP to visualize the results of K-means clustering on a a corpus of text documents across time.

As each time window has a number of documents that can be found in the succeeding time window, I generate a dictionary of relations and also obtain the distances of the documents from one another using the following process:

def get_distance(similarity):
    slice_dist = 1 - similarity # similarity -> numpy array of TFIDF scores
    slice_dist[slice_dist <= 0] = 0
    return slice_dist

def get_relation(from_df, to_df):
    slice1_ids = from_df['ids'].reset_index().drop(['received'], axis=1)
    slice2_ids = to_df['ids'].reset_index().drop(['received'], axis=1)

    shared_ids = list(set(slice2_ids['id'].tolist()) & set(slice1_ids['id'].tolist())) 
    ind1 = slice1_ids[slice1_ids['id'].isin(shared_ids)]
    ind2 = slice2_ids[slice2_ids['id'].isin(shared_ids)]

    relation = {}
    index1 = list(ind1.index)
    index2 = list(ind2.index)

    for i, item in enumerate(index1):
        relation[item] = index2[i]

    return relation

relations = []

for j, mat in slices.items():
    %time mat['distance'] = get_distance(mat['similarity'])

    if j > sliceKeys[0]:
        prev_mat = slices[j-1]
        %time relations.append(get_relation(prev_mat, mat))

distances = [] # Each time slice's distance is added to an array so that I have an array of distances
for j, mat in slices.items():
    distances.append(mat['distance'])

My Aligned UMAP settings are as follows:

%%time
aligned_mapper = umap.AlignedUMAP(n_neighbors=5,
    min_dist=0.05,).fit(distances, relations=relations)

My distances array looks like this: image

Previously this approach gave me no issues. However, I've been testing out new results and have been getting the error below over and over.

/Users/bianchi_dy/opt/anaconda3/lib/python3.7/site-packages/umap/spectral.py:256: UserWarning: WARNING: spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  "WARNING: spectral initialisation failed! The eigenvector solver\n"
/Users/bianchi_dy/opt/anaconda3/lib/python3.7/site-packages/umap/umap_.py:905: RuntimeWarning: overflow encountered in true_divide
  result[n_samples > 0] = float(n_epochs) / n_samples[n_samples > 0]`

and the following traceback, which tells me I'm dividing by zero somewhere I'm not supposed to be?

--------------------
LinAlgErrorTraceback (most recent call last)
<timed exec> in <module>

~/opt/anaconda3/lib/python3.7/site-packages/umap/aligned_umap.py in fit(self, X, y, **fit_params)
    357                     embeddings[-1],
    358                     next_embedding,
--> 359                     np.vstack([left_anchors, right_anchors]),
    360                 )
    361             )

~/opt/anaconda3/lib/python3.7/site-packages/numba/np/linalg.py in _check_finite_matrix()
    751         if not np.isfinite(v.item()):
    752             raise np.linalg.LinAlgError(
--> 753                 "Array must not contain infs or NaNs.")
    754 
    755 

LinAlgError: Array must not contain infs or NaNs.

Any ideas as to what might be causing this error or how to fix it? My suspicion is that it's to do with distances but I'm not sure if I need to perform some sort of normalization or pre-processing aside from turning TFIDF similarity scores into distances. Unfortunately this error came up the night before a deadline I was intending to use Aligned UMAP for, so it'd be great if anyone could point me in the right direction to solving this even in a hacky way.

lmcinnes commented 3 years ago

The easiest hacky way would be to use init="random". It won't be great, but it will get you something in the interim. I'll have to think a little harder about what could be going astray and get back to you for a better solution.

bianchi-dy commented 3 years ago

Update: I've tried init="random", metric="cosine" (because I'm using cosine distances, actually) so far – still getting the same the error.

lmcinnes commented 3 years ago

I'm afraid I can't help too much more without being able to reproduce the problem at this end. One catch might be the disconnection distance. You could explicitly set that to something large when working with cosine distance. disconnection_distance=3.0 ought to do the job if that is what is causing the problem.

gclen commented 3 years ago

I don't have a solution but here is a small example that reproduces the problem for me

import umap
import umap.utils as utils
import umap.aligned_umap

import sklearn.datasets
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

dataset = fetch_20newsgroups(subset='all',
                             shuffle=True, random_state=42)

vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)

constant_dict = {i:i for i in range(word_doc_matrix.shape[0])}
constant_relations = [constant_dict for i in range(1)]

neighbors_mapper = umap.AlignedUMAP(
    n_components=2,
    metric='hellinger',
    alignment_window_size=2,
    alignment_regularisation=1e-3,
    random_state=42,
    init='random',
).fit(
    [word_doc_matrix for i in range(2)], relations=constant_relations
)
lmcinnes commented 3 years ago

Thanks for the reproducer. I'll try to look into this when I get a little time.

GregDemand commented 2 years ago

I've fixed this issue with pull request #875. Basically the problem was in umap_.py line 919:

result[n_samples > 0] = float(n_epochs) / n_samples[n_samples > 0]

where the guard part of the statement didn't match the calculation. The easiest fix was casting n_samples from np.float32 to np.float64 to match the type of result.

result[n_samples > 0] = float(n_epochs) / np.float64(n_samples[n_samples > 0])

This could have alternatively been fixed by refining the guard part of the statement to something like:

result[n_samples/n_epochs > 0] = float(n_epochs) / n_samples[n_samples/n_epochs > 0]

but that solution looks worse.