Open Minhvt34 opened 1 year ago
Yeah, I think so, after running some stress tests with random data I found that the error occurred when I surpassed data of shape (10000x12) on a standard computer. You may want to use this validation withe a sample of your data (like cross-validation) or increasing your computational capacity.
I had similar problems. After looking into the code of validation.py
and running the code without numba
, I found that there is an indexing error in the for loop starting at row 50. The array indices_embedded
has one less column than max_k, but the code tries to access its column max_k-1
(which does not exist). This is because the first column of indices_embedded
was removed in row 78, so that a point does not get compared to itself.
Changing for j in range(max_k)
to for j in range(max_k-1)
in lines 50 and and 17 fixes the problem.
I have this problem when I using UMAP to reduce, my raw data with shape (250000x256) to (250000x8). I also tried to reduce to 2 dimention, but I still got the problem when I tried to calculate trustworthiness = validation.trustworthiness_vector(source=df_raw_data.to_numpy(), embedding=df_embedding.to_numpy(), max_k=30). I am wonderring that, is there any limitation in the size of dataset when applying umap validation?