Failure to separate highly separable data

jdonaldson / rtsne

An R package for t-SNE (t-Distributed Stochastic Neighbor Embedding)

58 stars 24 forks source link

Failure to separate highly separable data #4

Open s-andrews opened 7 years ago

s-andrews commented 7 years ago

I've been testing the latest CRAN version of this package on some data which should be highly separable and have been getting very poor results (basically no separation), so it looks like there's a bug which affects the ability to separate at least some datasets.

I've written up my tests at http://www.bioinformatics.babraham.ac.uk/tsne/ and have provided the data I used there too so you can replicate this. Others have also reported similar findings (links at at the end of my document) so I don't think it's just me.

jdonaldson commented 7 years ago

Thanks for the data, I'll look into it.

jdonaldson commented 7 years ago

It doesn't look like the tsne library is inferring the "type" of the transposed dataframe correctly for some reason. I'll pin that down.

One quick workaround is to pass in your transposed matrix with distances precalculated. This separates things as expected.

e.g.

tsne(dist(t(tsne.data)), perplexity = 5) -> tsne.result

rplot

jdonaldson commented 7 years ago

Also, regarding speed, the primary purpose of this library was an educational resource, with features for expressing progress and restartable convergence with pre-trained embeddings.

However, I'm currently working with another collaborator to implement the core logic in RCpp, and adding barnes hut-style techniques. This should bring it into performance parity with the other libraries (which are typically wrappers around the same cpp runtime).

s-andrews commented 7 years ago

Thanks for the replies and the work round for this data. For our immediate purposes I think we'll shift over to Rtsne (just for the speed boost as much as anything) but it's good to know there's another viable alternative too.