jlmelville / uwot

An R package implementing the UMAP dimensionality reduction method.
https://jlmelville.github.io/uwot/
GNU General Public License v3.0
322 stars 31 forks source link

Does uwot need connecting internet? #94

Closed summerghw closed 8 months ago

summerghw commented 2 years ago

Hey, I wander weather R package uwot need connecting internet to fuction. I run same data in same docker container in two computer, one did not connecting to internet, a error occur: 08:54:07 Writing NN index file to temp file /tmp/RtmpO90Kgu/file556f6f9565 08:54:07 Searching Annoy index using 1 thread, search_k = 3000 08:54:11 Annoy recall = 0.2088% 08:54:12 Commencing smooth kNN distance calibration using 1 thread 08:54:12 14365 smooth knn distance failures Error in x2set(Xsub, n_neighbors, metric, nn_method = nn_sub, n_trees, : Non-finite entries in the input matrix The program runs no problem on the computer which connecting to internet, the log is below: 08:53:01 Writing NN index file to temp file /tmp/Rtmpk0B9KK/file139c1144b046 08:53:01 Searching Annoy index using 1 thread, search_k = 3000 08:53:06 Annoy recall = 100% 08:53:07 Commencing smooth kNN distance calibration using 1 thread 08:53:09 Initializing from normalized Laplacian + noise 08:53:10 Commencing optimization for 200 epochs, with 583328 positive edges 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| 08:53:18 Optimization finished

SamGG commented 2 years ago

The annoy recall is definitively not the same. Did you set the seed in order to tend to be reproducible?

summerghw commented 2 years ago

yes, I set the seed my.seed <- 202106L

jlmelville commented 2 years ago

There is no connection to the internet required. No network communication of any kind should be happening.

I assume the various github actions for testing R packages make use of containers, so there shouldn't be a problem with using uwot with docker. All I can think of with the information provided is:

  1. The nearest neighbor search needs to be able to read and write the Annoy index to temporary disk space (as the message Writing NN index file to temp file /tmp/ indicates). Are you sure that both hosts are set up to provide this storage in the same way (e.g. same permissions, same amount of space)? There could be failures here where I have failed to detect these states and not provided an appropriate error message. In my own experience with getting containers to read and write data to host storage (albeit unrelated to uwot or R), I had to be quite careful with user permissions and matching user and group ids between host and container. But that was a few years ago.
  2. If you look at the Annoy issues, making it work with docker seems to be a rich source of problems. Usually these seem to be down to using compilation flags that only work on the machine where annoy was compiled, and the restrictions that CRAN puts on such flags should make us immune to that. But do you know if the machine on which uwot/the container was built has the same architecture as the two machines you are running the container on? Even very small changes seem to cause problems.

As an aside, the recall value you get (0.2) in the first case where things seem to be working might be a bit worrying: it means that for 80% of the observations in your dataset, they fail to find themselves as their own nearest neighbor. Either the nearest neighbor search is failing (could be due to not enough trees or too low a search_k value) or you have a lot duplicates. If you aren't expecting duplicates in your data, it's worth investigating that before proceeding.

summerghw commented 2 years ago

@jlmelville thank you for reply.