jlmelville / uwot

An R package implementing the UMAP dimensionality reduction method.
https://jlmelville.github.io/uwot/
GNU General Public License v3.0
321 stars 31 forks source link

umap_transform causes R Studio to abort (R encountered a fatal error.) #102

Closed ChVav closed 2 years ago

ChVav commented 2 years ago

Hi!

My R Studio session crashes when I try to use umap_transform. No further error messages given. I tested uwot_0.1.11 and 0.1.14, but exactly the same happens.

Many thanks!

Example code:

library(uwot)
train <- iris[1:100,]
test <- iris[101:150,]

set.seed(42)
train_umap <- umap(train, n_components = 50, ret_model=TRUE, y=train$Petal.Length)
set.seed(42)
test_umap <- umap_transform(test,train_umap)
library(uwot)
sessionInfo()

R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C LC_TIME=English_United States.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] uwot_0.1.14 Matrix_1.4-1

loaded via a namespace (and not attached): [1] Rcpp_1.0.8.3 umap_0.2.9.0 RSpectra_0.16-1 compiler_4.2.0 pillar_1.7.0 tools_4.2.0 digest_0.6.29
[8] jsonlite_1.8.0 evaluate_0.15 lifecycle_1.0.1 tibble_3.1.7 lattice_0.20-45 pkgconfig_2.0.3 png_0.1-7
[15] rlang_1.0.4 DBI_1.1.2 cli_3.3.0 rstudioapi_0.13 yaml_2.3.5 xfun_0.30 fastmap_1.1.0
[22] dplyr_1.0.9 knitr_1.39 generics_0.1.2 vctrs_0.4.1 askpass_1.1 tidyselect_1.1.2 grid_4.2.0
[29] reticulate_1.26 glue_1.6.2 R6_2.5.1 fansi_1.0.3 rmarkdown_2.14 purrr_0.3.4 magrittr_2.0.3
[36] htmltools_0.5.2 ellipsis_0.3.2 assertthat_0.2.1 utf8_1.2.2 openssl_2.0.0 crayon_1.5.1

jlmelville commented 2 years ago

Hello, thanks for the report and the reproducible example. Tracking down what's happened is going to take me longer than the time I have for today, but I see that this example has also revealed some other problems that would probably cause issues even if the underlying memory error was fixed.

In the code you provide, the embedded coordinates from the initial run of umap contain NaN. umap_transform ought to check for this and give an error (bug number 1). umap should also check that the initial data doesn't contain NA, especially if it is responsible for generating the data (bug number 2).

The reason why those NaNs are occurring is because you have set n_components=50 but the initial dimensionality of the iris dataset is only 4. I don't recommend trying to generate an embedding where n_components is greater then the dimensionality of the dataset. Again, the umap function should check for this and prevent this occurring (bug number 3 and we haven't even got to the real problem yet).

I should check at this point @ChVav: did you mean to use n_components = 50 in this example or did you mean n_neighbors = 50? The latter makes more sense for iris, but I understand that you may have been using a different dataset that can't be shared for reproducibility purposes. If you did mean to use n_components = 50 with a dataset with a similarly low dimensionality as iris then please be aware that even after I fix the bug that is causing the crash, this is unlikely to ever work: both a spectral and PCA-based initialization will give NA after the first 4 components so you will need to set init="rand" to umap or pass a user-defined initialization. And that's even if I can be persuaded to not make setting n_components higher than the number of columns in an input dataframe or matrix a bug.

ChVav commented 2 years ago

Hi, thanks for answering so fast. This is very helpful.

Silly me, yes, my actual training/test set has >800,000 variables, so I am in the end meaning to grid search what n_components to reduce dimensions to. For the iris dataset n_components = 50 of course does not make sense, my bad. Unsupervised clustering with the umap package at least worked fine for my full dataset, testing up to 250 components. I check for remaining NAs after imputation, so this is not the issue on my actual data.

I am testing my whole scheme with supervised dimension reduction on a subset of 5000 columns, and based on your suggestion of an underlying memory issue found the code to work for n_components = {2,3,10} but not n_components=25. So thanks, at least I understand now why this crash was happening and can try and compute all this on a server.

Thank you!

jlmelville commented 2 years ago

@ChVav the problem should now be fixed on the master branch of this repo. The crash will be triggered whenever n_components > n_neighbors. This is a serious enough bug to merit a new release on CRAN but unfortunately I don't have a lot of time to do this for a while. Also I am not sure of a workaround. My apologies.

I would like to keep this issue open until I also fix the bugs around checking for NA in initial data and warning when n_components is probably set too high.

ChVav commented 2 years ago

yes, just tested your version on the master branch, and works also for n_components > n_neighbors.

Many thanks for the help! :)