lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.37k stars 799 forks source link

UMAP validation trustworthiness_vector got 0xC00000FD on window and Segmentation fault on WSL. #963

Open Minhvt34 opened 1 year ago

Minhvt34 commented 1 year ago

image I have this problem when I using UMAP to reduce, my raw data with shape (250000x256) to (250000x8). I also tried to reduce to 2 dimention, but I still got the problem when I tried to calculate trustworthiness = validation.trustworthiness_vector(source=df_raw_data.to_numpy(), embedding=df_embedding.to_numpy(), max_k=30). I am wonderring that, is there any limitation in the size of dataset when applying umap validation?

ejyepezm commented 1 year ago

Yeah, I think so, after running some stress tests with random data I found that the error occurred when I surpassed data of shape (10000x12) on a standard computer. You may want to use this validation withe a sample of your data (like cross-validation) or increasing your computational capacity.

idekany commented 6 months ago

I had similar problems. After looking into the code of validation.py and running the code without numba, I found that there is an indexing error in the for loop starting at row 50. The array indices_embedded has one less column than max_k, but the code tries to access its column max_k-1 (which does not exist). This is because the first column of indices_embedded was removed in row 78, so that a point does not get compared to itself.

Changing for j in range(max_k) to for j in range(max_k-1) in lines 50 and and 17 fixes the problem.