lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.4k stars 805 forks source link

UMAP segmentation faults with correlation and cosine metrics in certain datasets #956

Open edmoman opened 1 year ago

edmoman commented 1 year ago

UMAP version 0.5.3 numba version 0.56.4

UMAP segfaults with correlation and cosine metrics but not with other metrics (euclidean, manhattan and camberra tested). The issues only occurs with very specific datasets.

Here is my code:

clusterable_embedding_test = umap.UMAP(
    n_neighbors=30,
    min_dist=.0,
    n_components=9,
    random_state=31416,
    #metric='manhattan',
    #metric='canberra',
    #metric='cosine',
    metric='correlation',
).fit_transform(df_clean)

I have tested in native Ubuntu 22.04, Ubuntu 22.04 under WSL2 and Windows. Same issue.

The issue only manifests itself when I remove a specific column from the dataset. If the column is present the code runs fine.

It does not matter how the column is removed (whether dropped from the dataframe, not loaded or removed beforehand from the CSV file).

Removing or adding other columns is OK, regardless of how many.

Setting NUMBA_DISABLE_JIT=1 prevents the segfault but results in extremely slow execution.

NUMBA_DISABLE_INTEL_SVML=1 has no effect. Same for a large number of other numba-related variables I have tried.

edmoman commented 1 year ago

The Julia implementation of UMAP seems to work:

using CSV
using DataFrames
using Distances
using UMAP

df = CSV.read("credit_risk_scores_equation_label_encoded.csv", DataFrame;)

df = select!(df, Not(:"date_opening"))

df_clean = Matrix{Float64}(df)

embedding = umap(df_clean, 9; n_neighbors=30, min_dist=0.00000000001, metric=CosineDist())
embedding_3d = umap(df_clean, 3; n_neighbors=30, min_dist=0.00000000001, metric=CosineDist())

print(embedding)
print(embedding_3d)

The offending line in Python (dropping the column) does to prevent Julia's UMAP from running (df = select!(df, Not(:"date_opening"))).

However, in Python min_dist can be zero, but not in Julia.

Minhvt34 commented 1 year ago

Have you tried to manipilate your dataset with other dimension reduction method?

I have already encountered "Segmentation fault" when using umap validation, it means that my program had crashed. So, If your dataset (with or without the specific columns) still works with other method, it could be that Umap encounters some calculation problem when applying other metrics. Otherwise, I think, your dataset sensitively demands on those columns.

edmoman commented 1 year ago

Thanks. Yes, in addition to UMAP in Python (which crashes but only with cosine and correlation metrics), I have tried the Julia implementation of UMAP (that works fine) and also tSNE in Python (which also works fine).

edmoman commented 1 year ago

Please, note that the error only happens when I remove a specific column from the dataset. If I add other columns or remove other features, it is OK.

In any case, now we have decided to keep that column and I have also switched to canberra metrics, so everything is working fine.