umap_transform can give odd results with dens_scale

FemkeSmit commented 2 years ago

I have a dataset consisting of ~1500 female samples and ~1600 male samples, both with the exact same variables and similar distributions between the two sexes. I've stratified this dataset based on sex and made a separate UMAP for them both. I then attempted to project the male dataset into the female UMAP space and vice versa using umap_transform, which worked fine for the female samples, and worked fine for most of the male samples, except that about 100 male samples got projected onto a ring surrounding the other datapoints, far away from them. I then reduced my male dataset to be the same size as the female dataset by removing the last 100 samples (the order of the samples is completely random) and this ring disappeared.

newCoor_female <- umap_transform(strat_dat$Female[,-1], umap_res$Male)
newCoor_male <- umap_transform(strat_dat$Male[1:nrow(strat_dat$Female),-1], umap_res$Female)

umap_res_gb <- umap_res
umap_res_gb$Female$embedding <- newCoor_female
umap_res_gb$Male$embedding <- newCoor_male

umap_res_gb %>%
  map(pluck, "embedding") %>%
  map(data.frame) %>%
  bind_rows(.id = "sex") %>%
  ggplot(aes(X1, X2)) +
  geom_point(alpha = .5, size = 1) +
  facet_wrap(~sex, nrow = 1, scales = "free") +
  theme_bw() +
  labs(x = "UMAP1", y = "UMAP2")

-->

jlmelville commented 2 years ago

That's definitely weird. Does the same thing happen if you remove the first 100 items rather than the last 100 items?

Unfortunately I am unlikely to be able to investigate this for at least a couple of weeks, but I will try to take a look when I can.

FemkeSmit commented 2 years ago

Yes, it doesn't matter which samples I remove, as long as by the end of it the projected dataset is of the same size or smaller as the dataset who's space it's being projected into.

jlmelville commented 2 years ago

@FemkeSmit sorry for the delay is getting back to this. I am having trouble reproducing the problem with the datasets I have. Can you tell me what version of uwot you are running? Also, if you are able to install packages from github, would you be able to run the code below and let me know if you see the ring structure?

devtools::install_github("jlmelville/snedata")
devtools::install_github("jlmelville/vizier")

mnist <- snedata::download_mnist()
mnist_train <- head(mnist, 60000)
mnist_test <- tail(mnist, 10000)

mnist_umap_test <- umap(mnist_test, ret_model = TRUE)
mnist_umap_train_transform <- umap_transform(mnist_train, mnist_umap_test)

vizier::embed_plot(mnist_umap_test$embedding, mnist_test, cex = 0.1, alpha_scale = 0.1, title = "10 000 model points")
vizier::embed_plot(mnist_umap_train_transform, mnist_train, cex = 0.1, alpha_scale = 0.1, title = "60 000 transformed points")

These are the images I get, where the 60,000 MNIST training images in the second image are transformed using a model built with the smaller (10,000 image) test set. So it seems like there must be something else going on other than when the original dataset is smaller than the dataset passed to umap_transform.

10 000 model points 60 000 transformed points

FemkeSmit commented 2 years ago

@jlmelville I'm running version 0.1.14 of uwot. A few coworkers of mine actually also ran into this issue - the ring being formed - with their dataset, and also managed to solve it by reducing the size of the dataset that was being transformed, so it's not an issue unique to me. Still, when I run your code the ring doesn't form, so I don't know what might be different between these cases.

FemkeSmit commented 2 years ago

Another update: I just used your code for creating the UMAP object on my data, and now no ring formed.

umap_res2 <- strat_dat %>% map( ~{ dat <- select(.x, -eid) umap( dat, ret_model = TRUE ) } ) With this, no ring formed (after transformation, like described before).

umap_res <- strat_dat %>% map( ~{ dat <- select(.x, -eid) nn <- 10 umap( dat, n_components = 2, n_neighbors = nn, nn_method = "annoy", n_trees = 100, n_sgd_threads = "auto", init = "pca", n_epochs = 500, approx_pow = TRUE, binary_edge_weights = TRUE, dens_scale = 1, ret_extra = c("model", "nn", "fgraph"), verbose = FALSE, ret_model = TRUE ) } ) With this, a ring did form.

Edit: I just tried using my original umap settings on the mnist dataset, but there no ring forms. I wouldn't know why.

jlmelville commented 2 years ago

Ok, so the ring seems to be due to one or a combination of parameters. If you are able to continue helping me, can you try your umap parameters, but turn off the following parameters one at a time (i.e. re-run 4 times, each time with one of these removed):

dens_scale = 1
binary_edge_weights = TRUE
approx_pow = TRUE
init = "pca"

I don't want to prejudge matters, but this is in decreasing order of suspicion (so I suspect it's dens_scale causing the issue). We can look at some of the other parameters if this doesn't have an effect.

FemkeSmit commented 2 years ago

It was indeed dens_scale! This is the result if I remove that parameter:

jlmelville commented 2 years ago

The dens_scale parameter is very new so needs some experimentation. What could be happening is that the original data has a different density to the transformed data and when transforming the new data, the new density parameters are large extrapolations outside the useful range. If you want to diagnose this more:

For transforming new data where there is a chance of different densities, it's probably better to set dens_scale to a more conservative value (e.g. dens_scale = 0.5). This will allow the transformed data to work in a region with less chance of numerical issues.
Carry out UMAP on all the data and include localr in your ret_extra vector, i.e. ret_extra = c("model", "nn", "fgraph", "localr"), then also look at a plot where the points are colored by localr, which are the estimates of the density around each point used with dens_scale. If the points you were planning to transform have a very different color to the others that would suggest that using a smaller dens_scale for the initial umap would be a good idea.

I need to update the documentation around this. Also, in umap_transform, you can't currently export the localr or the downstream parameters used when dens_scale is involved, which would help with diagnosing this sort of thing. So I will keep this issue open to add all that. The development version of umap_transform should have better safe-guarding around extreme values, but I may need to take another look and maybe add a warning if I can detect this (if this is actually what's causing the ring issue).

If using dens_scale is still of interest to you, I would be interested to know if a smaller value of dens_scale is able to help preserve some of the density information but not ruin the transformation.

FemkeSmit commented 2 years ago

Interestingly enough, when I delete the dens_scale setting from the function altogether the ring disappears, but if I set it to 0 it appears again. Here's the results coloring by localr for dens_scale = 1, dens_scale = 0 and dens_scale = NULL.

jlmelville commented 2 years ago

dens_scale = NULL and dens_scale = 0 should give the same results, and does for all the datasets I looked with the current development version of uwot. So that might be a problem with the current CRAN version of uwot.

Anyway, as for the ring structure, I am still struggling a bit to generate something that looks like what you get. I do see a ring structure when embedding two overlapping Gaussians, where one has a much larger standard deviation than the other (but they have the same center). When dens_scale = 1, if you run UMAP on the smaller cluster first, then attempt to transform the larger cluster, you do get the larger cluster forming a ring(ish) shape around the smaller cluster. With dens_scale = NULL there is much less shape (like the dens_scale = NULL structure you have).

So, could it be that the dataset you are using, when stratified by sex, results in data where the male subset has features with a substantially smaller variance (on average) than the female subset? I would expect that to be reflected in the localr results, and it does appear that possibly the outer ring contains much very few points with a large localr? It would be easy enough to check this by running summary on the localr vectors for the umap output for the different subsets, but also looking at the variance of the input features of the data may do the same thing.

I would like to understand a bit more about the data you are using: can you say how many columns the data has? ~~Could you also provide the result of calling summary on the combined data and then the male and female subsets?~~ Edit: actually unless there are very few columns that's a terrible idea. I meant can you call summary on the result of calculating the standard deviation on each column of your data, e.g. summary(apply(the_numeric_columns_of_your_data, 2, sd)) Also: are there duplicates in the dataset? I understand if providing this is not possible, but I would very much like to understand what's happening a bit more.

At any rate, now that I know dens_scale is involved, my recommendation would be that if you are intending to transform new data and the new data is unlikely to be from the same distribution as the original data, then dens_scale should not be used to build the UMAP model: we can't get a good estimate of the local density if none of the other data that would be local to that point is in the original dataset. Purposely stratifying into the male/female split seems like it could be causing a problem like that somehow. I should add a warning to umap_transform at the very least and it may be necessary to prevent transforming new data if dens_scale is used -- densMAP also does not allow for transforming new data perhaps for similar reasons.

jlmelville commented 2 years ago

I edited the title of the issue to reflect my current understanding of what is going on

FemkeSmit commented 2 years ago

My data consists of 10 numeric variables with about 1500 female samples and 1600 male samples. All variables have been normalized to have mean 0 and sd 1. Most variables follow a normal distribution, though some have a long tail, and one is vaguely binomial. All distributions are very similar between the male and female samples.

I tried running the UMAP again with dens_scale = 1, but this time removing a bunch of samples from the dataset so there were now 1500 female samples and 1400 male samples, and then transforming into each other's space again. This time a ring formed around the female UMAP projection instead of the male UMAP projection.

Removing both the last 100 samples: newCoor_female <- umap_transform(strat_dat2$Female[0:1400,-1], umap_res2$Male) umap_res_gb$Female$embedding <- newCoor_female or the first 100 samples: newCoor_female <- umap_transform(strat_dat2$Female[101:1500,-1], umap_res2$Male) umap_res_gb$Female$embedding <- newCoor_female from the female dataset when transforming the female coordinates removed the ring.

-->

jlmelville commented 2 years ago

Thank you for the information. Seems that there isn't anything in your data that should cause the problem, so I went back to the MNIST example and I can now reproduce the ring structure, now I know dens_scale is involved:

Not sure exactly what is going on: depending on the seed, the session tends to die more often than not which unfortunately means debugging the C++. But this is sort-of reproducible so I will try to fix it.

jlmelville commented 2 years ago

@FemkeSmit I found the error: arrays for the original and new data were swapped. The current development version of uwot has a fix. There will be some other pushes to document and test the fix but what is currently there should now work correctly.

I'm sorry I failed to test this code path appropriately, and I appreciate the extensive help in hunting this down.

FemkeSmit commented 2 years ago

No problem, I'm happy to help. Glad you managed to find and solve the issue!

jlmelville commented 2 years ago

Hopefully there isn't much more to say on this, but I have also fixed two other issues that has arisen from this discussion:

approx_pow parameter is ignored when dens_scale is non-NULL. There will now be a warning to that effect if approx_pow = TRUE and dens_scale is set.
umap was not exporting whether binary_edge_weights was set and umap_transform wasn't checking for it and hence not setting the new edges to 1/0 when transforming new data. This is now fixed (but shouldn't have a massive effect if other settings are kept to their typical values).

jlmelville commented 8 months ago

This seems fixed

jlmelville / uwot

umap_transform can give odd results with dens_scale #103