ropensci / unconf18

http://unconf18.ropensci.org/
44 stars 4 forks source link

Implementation of non-linear dimensionality reduction algorithm (UMAP) #43

Open seaaan opened 6 years ago

seaaan commented 6 years ago

I recently read about a new non-linear dimensionality reduction algorithm called UMAP (github, arxiv), which is much faster than t-SNE, while producing two-dimensional visualizations that share many characteristics with t-SNE. I initially found out about it in the context of use on high-dimensional single-cell data in this paper.

The reference implementation is in Python (see github link above). It can be run in R through rPython as shown here. There is an R package designed for comparing dimensionality reduction techniques that contains an implementation of UMAP, but this package is "not suitable for large scale visualization" and I'm not completely sure based on the README whether it is an accurate or approximate implementation.

My thought is that the ideal would be a package focused on UMAP specifically, implemented in R or Rcpp. Unfortunately I am not at all an expert in this topic or familiar with the mathematics involved, so the best I would be able to do is try to translate the Python implementation into R.

malisas commented 6 years ago

Hi @seaaan , I use t-SNE at work all the time to analyze flow data and would potentially be interested in something like UMAP (especially when I have more than 100,000 data points and the t-SNE runtime starts to slow me down). I have zero experience in C++ or the mathematics involved but would like to learn both topics, and at any rate would like to express my interest in your proposal.

I'm quite ignorant about this entire topic, but what is the benefit of re-implementing the algorithm in R/Rcpp as opposed to relying on the bindings? (Though even if the user experience is the same, I can see how this could still be a really cool educational project.)

seaaan commented 6 years ago

I am also not an expert in this topic, but from what I can tell, the advantage of an R or Rcpp version as compared to Python would mainly be convenience. Running UMAP through rPython requires you to install Python and UMAP first and then install rPython (which works from CRAN for Linux and Mac but requires some additional effort for Windows). So you can get it to work but it's not as seamless as just installing a single package from R.

I could be wrong about some of the steps as I haven't had a chance to actually test the rPython version yet, so maybe it's not as complicated as it sounds.

On May 14, 2018 10:56 AM, "Malisa" notifications@github.com wrote:

Hi @seaaan https://github.com/seaaan , I use t-SNE at work all the time to analyze flow data and would potentially be interested in something like UMAP (especially when I have more than 100,000 data points and the t-SNE runtime starts to slow me down). I have zero experience in C++ or the mathematics involved but would like to learn both topics, and at any rate would like to express my interest in your proposal.

I'm quite ignorant about this entire topic, but what is the benefit of re-implementing the algorithm in R/Rcpp as opposed to relying on the bindings? (Though even if the user experience is the same, I can see how this could still be a really cool educational project.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/unconf18/issues/43#issuecomment-388906818, or mute the thread https://github.com/notifications/unsubscribe-auth/AKDNdeMEDuPc3cWkMKAMTDS1Y2jZLJ-Zks5tycVdgaJpZM4TkgTE .

noamross commented 6 years ago

If the Python version is performance, it might be worthwhile to try to wrap it with the reticulate package rather than port it. With reticulate you can share in-memory objects with R so data transfer is more efficient, and I think is cross-platform.

seaaan commented 6 years ago

Thanks! So here are the options as I see them:

seaaan commented 6 years ago

Summary: UMAP is a new non-linear dimensionality reduction algorithm that's like t-SNE but faster. It can be used for all kinds of data but I'm interested in it for flow cytometry and single cell RNA sequencing. We could either wrap the Python implementation or implement it ourselves in R/Rcpp.

PeteHaitch commented 6 years ago

There are definitely people in the Bioconductor community interested in this (although I don't know if any are going to be at the unconf). We discussed a reticulate wrapper around the Python implementation when a few of us were together at the collaborative computational tools for the human cell atlas. @drisso may remember who showed the most interest in doing this or of any efforts underway.

stefaniebutland commented 6 years ago

There are definitely people in the Bioconductor community interested in this (although I don't know if any are going to be at the unconf)

I think @lcolladotor said he attends annual Bioconductor meetings. Lori Shepherd @lshep from Bioconductor Core team participated in unconf17

drisso commented 6 years ago

Not surprisingly @LTLA was part of the conversation. It was in the more general context of how to create an interface between Bioconductor and scanpy. I won't be at the unconf, but I'm happy to help remotely, if needed!

I think a reticulate wrapper could be the easier solution, unless a C++ implementation is much faster than the original python.

LTLA commented 6 years ago

I won't go into it too much to avoid derailing this thread, but my idea would be to make it as easy (and standard, and reliable) to call Python code in an R package as it is to call C/C++/Fortran code. The difficulty lies in how we are able to (or, currently, not!) control the versions of Python and its packages, in order to guarantee consistent behaviour and ensure easy installability across systems.

A reticulate wrapper around UMAP would indeed be easier in terms of the amount of initial work. However, without a standard framework for controlling Python packages and versions from within R, it shifts the burden onto the end-user (and ultimately back to developers in terms of support requests).

lcolladotor commented 6 years ago

Hi. I don't really use Python and was not part of the work Peter, David and Aaron Lun did. So I can't really comment on this thread.

Best, Leo

seaaan commented 6 years ago

Thanks all for chiming in! I think it would be great to be able to call Python code in R as easily as you can call C++ code. That's way beyond the area of my expertise, unfortunately.

My feeling is that if we do this project, we should reimplement the algorithm in R. Since I work in a Windows environment where we have to call IT to do anything requiring administrative privileges, installing new programs and updating PATH variables is a big pain at work. For that reason it's really nice when an R package just works without any external dependencies. So that's my view. Of course it will be up to everyone who wants to participate to decide together, so we'll see what happens on Monday.

Thanks again all.

Sean

On Fri, May 18, 2018 at 7:02 PM, Leonardo Collado-Torres < notifications@github.com> wrote:

Hi. I don't really use Python and was not part of the work Peter, David and Aaron Lun did. So I can't really comment on this thread.

Best, Leo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/unconf18/issues/43#issuecomment-390370646, or mute the thread https://github.com/notifications/unsubscribe-auth/AKDNdUlhUhe1YgWjGkgH5D-DFGXUFnGpks5tz30sgaJpZM4TkgTE .

seaaan commented 6 years ago

Possible tasks for this project.

Logistics:

Documentation:

Testing:

Package functionality:

LTLA commented 6 years ago

If you're going to translate it anyway, you'll probably want to write it in C++. Not only will this interface with R via Rcpp, it may also be useful more generally, as a good implementation will plug in directly to all languages that support C++ bindings... or for anyone who wants to make an executable.

It seems that the trick is to break up the Python implementation into chunks that can be easily written and tested in isolation. I've had a look at their code and it seems pretty tidy. Not entirely straightforward, but not the most complicated either. After that, it's just a case of bashing out a word-for-word C++ implementation.

I won't be going to unconf but I might be able to spare some weekends to help with the last (and easiest) bit, provided we have a list of test inputs and outputs for each section (and the pseudo-code to do it).

juyeongkim commented 6 years ago

See umapr