qiime2 / q2-diversity

BSD 3-Clause "New" or "Revised" License
4 stars 45 forks source link

Added t-SNE and UMAP Function #300

Closed TimJLU closed 3 years ago

TimJLU commented 4 years ago

I added t-SNE Function to the _ordination.py File.

It is just a first draft to start a Discussion and still in progress.

Next step would be: Better way to pass Arguments. Create a default setting if you don't pass any Arguments.

sjanssen2 commented 4 years ago

This PR is a result of our discussion in the forum: https://forum.qiime2.org/t/t-sne-or-umap-as-q2-plugins/14477

sjanssen2 commented 4 years ago

I wonder why Travis isn't trigged by this PR - are we missing something here? Or do we have to be more patient?

thermokarst commented 4 years ago

I wonder why Travis isn't trigged by this PR - are we missing something here?

I have no clue why it didn't trigger... Maybe GitHub had a hiccup? EIther way, I don't expect the tests to tell us to much, since the function added here hasn't been wired into the plugin, and there are no unit tests yet.

This is pretty exciting, let us know what you need from us!

TimJLU commented 3 years ago

The recent check has failed, but I don´t get the Error Message. Is it because of the tsne Function and the way I used **kwargs?

sjanssen2 commented 3 years ago

We should think about which parameters of the algorithms should be exposed to the user via the plugin. I'd think exposing all parameters is overkill and will confuse the default user. Unfortunately, I am not very familiar with both techniques and cannot answer this question due to personal experience :-/

t-SNE

Reading about t-SNE, I think our plugin should expose perplexity and n_components. If the Qiime2 framework allows access to the --verbose parameter, we might also want to pass that to the verbose flag of the sklearn algorithm. And since we will always pass a precomputed distance matrix, n_jobs should be set to 1. Not sure if we want to expose it as --p-n-jobs to indicate that multiple CPUs/cores won't speed up the process?

UMAP

UMAP documentation has the great chapter https://umap-learn.readthedocs.io/en/latest/parameters.html# which tells me we should expose n_neighbors, min_dist and of course n_components

sjanssen2 commented 3 years ago

many thanks @thermokarst. @TimJLU is on a writing sprint to finish his masters thesis. It contains some benchmark and examples for applying PCoA, tSNE, umap to different datasets. Hope we can distil his results into a nice little summary to show that each method suits different purposes and therefore it is great to have them all in qiime2

TimJLU commented 3 years ago

I finally were able to implement all your requested changes @thermokarst.

Right now the number_of_dimensions parameters are required (but since they specify a default value of 2, you don't necessary have to provide that as input). Looking at the pcoa method in this plugin, it looks like there we have made the number_of_dimensions parameter completely optional. I am curious why you aren't doing that here - care to explain?

During my thesis I got the best visual results for t-SNE and UMAP by reducing the dimensions to 2. The idea was to use recommended default settings so that inexperienced Users could get the best results without changing any of the variables.

TimJLU commented 3 years ago

Thank you very much @ebolyen and @thermokarst. I hope I changed everything accordingly. The only question still up is how to deal with the numba package within in the umap-learn package? Did I miss something?

TimJLU commented 3 years ago

Hey, i saw that the numba package got an update on August 20. Maybe this would solve the problem on it's own?

gregcaporaso commented 3 years ago

@TimJLU, thanks so much for this pull request and apologies for how long it took us to get this in!

I just pulled this branch down and tested both new actions with the Moving Pictures tutorial data and a new tutorial data set that we're working with (Liao et al, 2021). The umap-learn team's conda install command worked for me, so it looks like that issue is sorted out, and the plots I generated all look reasonable. We're going to run the tests, and as long as everything seems to be working we'll get this merged for the 2021.11 release (scheduled for next week).