heal-research / HeuristicLab

HeuristicLab - An environment for heuristic and evolutionary optimization
https://dev.heuristiclab.com
GNU General Public License v3.0
36 stars 16 forks source link

t-Distributed Stochastic Neighbor Embedding #2700

Closed HeuristicLab-Trac-Bot closed 7 years ago

HeuristicLab-Trac-Bot commented 7 years ago

Issue migrated from trac ticket # 2700

milestone: HeuristicLab 3.3.15 | component: Algorithms.DataAnalysis | priority: medium | resolution: done

2016-11-11 13:14:41: @BernhardWerth created the issue


t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets and should be avialable in HeuristicLab

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-11 13:14:55: @BernhardWerth changed status from new to assigned

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-11 13:14:55: @BernhardWerth set owner to @BernhardWerth

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-11 13:16:00: @BernhardWerth changed type from defect to feature request

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-11 13:16:59: @BernhardWerth changed status from assigned to accepted

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-11 14:33:53: @BernhardWerth commented


r14387 created initial branch

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-25 11:00:32: @BernhardWerth commented


r14413 refactored c++ style code to C# (use of [,] arrays, int vs uint,..) + corrected IterationsCounter

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-25 11:00:32: @BernhardWerth

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-25 11:01:01: @BernhardWerth commented


r14414 forgot to add files

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-25 11:01:22: @BernhardWerth changed status from accepted to reviewing

HeuristicLab-Trac-Bot commented 7 years ago

2016-11-25 11:01:22: @BernhardWerth changed owner from @BernhardWerth to @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-18 13:56:23: @gkronber commented


Comments from an initial review:

  • It should be possible to stop the algorithm at any time
  • A quality line chart for the error would probably be interesting
  • It would be nice to be able to view the projection after each iteration
  • The descriptions for parameters should contain information on default settings or useful settings for the parameters.
  • Is it necessary to tune all the parameters for learning? Or would it also be ok to just use some robust default settings and hide most of the parameters (except for perplexity)
  • I think it is not strictly necessary that TSNE derives from Item (since it is probably never used directly in the GUI)
  • Error message: "Perplexity should be lower than K" what's K?

Let's discuss this in person...

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-18 13:56:23: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-18 13:56:23: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-18 13:56:23: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-18 13:56:23: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-18 17:19:02: @gkronber changed status from reviewing to assigned

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-18 17:19:02: @gkronber changed owner from @gkronber to @BernhardWerth

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-19 07:51:09: @gkronber commented


r14503: minor change while reviewing

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-20 15:50:25: @BernhardWerth commented


r14512 worked in several comments from mkommend, extended analysis during algorithm run, added more Distances, made algorithm stoppable

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-21 12:06:56: @gkronber commented


More observations:

  • TSNE should be a BasicAlgorithm
  • Exception when switching between views (projected data & quality line chart) while the algorithm is running
  • r14512 added references to files for a kernel PCA in the project file (please remove).
  • Why does the error change abruptly when the 'stop-lying-iteration' is reached? (--> OK)
  • Hide parameters: *Momentum, Eta, MomentumSwitch, StopLying. Set StopLying to zero per default.
HeuristicLab-Trac-Bot commented 7 years ago

2016-12-21 12:06:56: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-21 12:06:56: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-21 12:06:56: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-21 12:06:56: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-21 12:06:56: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2016-12-22 10:09:39: @BernhardWerth commented


r14518 TSNEAnalysis is now a BasicAlg, hid Parameters, added optional data normalization to make TSNE scaling-invariant

HeuristicLab-Trac-Bot commented 7 years ago

2017-01-11 07:38:52: @BernhardWerth commented


r14558 made TSNE compatible with the new pausible BasicAlgs, removed rescaling of scatterplots during alg to give it a more movie-esque feel

HeuristicLab-Trac-Bot commented 7 years ago

2017-02-17 13:02:59: @abeham commented


r14682: fixed references for alglib and libsvm

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-10 08:38:11: @BernhardWerth changed status from assigned to reviewing

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-10 08:38:11: @BernhardWerth changed owner from @BernhardWerth to @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-10 08:38:11: @BernhardWerth commented


r14742 fixed displaying of randomly generated seed and some minor code simplifications

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-19 18:19:06: @gkronber commented


The 'performance-improved' distance methods which also accept a threshold seem to be implemented incorrectly. However, they are not used by tSNE anyway so I'm removing them.

 sum = 0;
 ...
 while(sum > threshold ...) {
   sum += ...
 }
 return sum;
HeuristicLab-Trac-Bot commented 7 years ago

2017-03-19 18:37:19: @gkronber commented


VPTree contains a TODO item //TODO check if minheap or maxheap should be used here

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-19 18:47:08: @gkronber commented


r14767: made some changes while reviewing the code

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 11:16:47: @gkronber commented


ABE: fixed revision numbers

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 11:16:47: @abeham

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 12:13:00: @abeham commented


What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 12:22:37: @gkronber commented


Replying to [comment:21 abeham]:

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Good point. I propose that we provide a different version of the algorithm for this case (because we don't have a dataset as for all other data-analysis algs).

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

I also already noticed that theta=0 produces very different results. We should leave this option for the case of N>5000 but use theta=0 as default. Additionally, we should look at the differences between theta=0 and theta=eps, maybe there is another issue hidden there.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

Full ack. I plan to refactor the code to provide a simple static method.

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 12:35:15: @gkronber commented


r14784: fixed build fail

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 12:50:42: @abeham commented


For comparison purposes, there's also a JavaScript implementation of TSNE (and probably some more).

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 13:36:13: @gkronber commented


TODO:

  • Add parameter to update results only X iterations (done)
HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 13:36:13: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 15:15:33: @gkronber commented


r14785: changes and while reviewing

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 15:29:13: @gkronber commented


r14787: more changes while reviewing

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-27 17:27:14: @gkronber commented


r14788: refactoring

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-30 18:09:03: @gkronber commented


r14806: worked on tSNE, storable and cloning for tSNE state. Added some TODO comments while reviewing.

HeuristicLab-Trac-Bot commented 7 years ago

2017-03-30 19:07:02: @gkronber commented


r14807: support clone and persistence and pause/resume

HeuristicLab-Trac-Bot commented 7 years ago

2017-04-10 15:47:41: @gkronber commented


Review comments:

  • in fast_tsne.m an initial dimensionality reduction using PCA is performed before running bh_tsne features is not absolutely necessary
  • normalization of data should be moved to TSNEStatic (ZeroMean and diving by max) this is not possible as the tSNE implementation is not specific to real vectors. Therefore, scaling was left in the main algorithm
HeuristicLab-Trac-Bot commented 7 years ago

2017-04-10 15:47:41: @gkronber

HeuristicLab-Trac-Bot commented 7 years ago

2017-04-10 15:47:41: @gkronber