lejon / TSne.jl

Julia port of L.J.P. van der Maaten and G.E. Hintons T-SNE visualisation technique.
Other
143 stars 25 forks source link

tnse() treats input matrix X rows as observations, should be columns #19

Open alyst opened 6 years ago

alyst commented 6 years ago

tsne(X) treats X rows as observations (points) and columns as their features (dimentions). I guess it's because the original implementation is written in Python/NumPy, which uses row-major order. However, Julia packages (Clustering.jl, Distances.jl, BlackBoxOptim.jl) tend to treat rows as features and columns as observations, because in Julia matrices are stored in column-major order (one column/observation occupies continuous block of memory).

IMO it would be nice to switch to the default Julian behaviour at some point, but that would be a big breaking change, of course.

oxinabox commented 6 years ago

Other notable package also treat columns as observations: MultivariateStats.jl, MLDataUtils.jl (though that is configurable, columns is the default), Distributions.jl

xukai92 commented 6 years ago

I agree. Row-major can actually cause performance issue: fetching a row is slow than fetching a column in Julia

juliohm commented 6 years ago

One more vote for column-major ordering.

TSne.jl could be absorbed by MultivariateStats.jl for better maintenance, peer review of PRs, and performance improvements.

alyst commented 6 years ago

@juliohm Technically speaking, you can submit PRs and get peer review right away. E.g. you are very welcome to submit col-major PR. Integration into MultivariateStats.jl (or the move under JuliaStats umbrella, as a more modest alternative) could make it easier for the users to discover tsne() and guarantee some minimal level of maintenance. But, based on my experience, I don't expect the overflow of PRs because of the code relocation.

oxinabox commented 6 years ago

@juliohm if someone were to implement this into MultivariateStats (which I think is a good idea), it will need to be clean-room implemented from the paper, or some other description of the algorithm as the license on this repo is not MIT compatible

juliohm commented 6 years ago

Yes, it is unfortunate that t-SNE is currently separate from the rest of the embedding methods in MultivariateStats.jl. I hope someone will tackle this issue in the future.