kwikteam / klustakwik2

Fast software for high-dimensional cluster analysis using the masked EM algorithm for Gaussians mixtures
BSD 3-Clause "New" or "Revised" License
27 stars 13 forks source link

algorithm can only drop one cluster per iteration? #47

Closed nsteinme closed 9 years ago

nsteinme commented 9 years ago

E.g. see snippet of run-time output below. Not sure if this is related to the "consider cluster deletions" thing or whether it is expected/appropriate behavior or what. I have definitely seen it before - I think with KK1 also - but I figured it was worth putting as an issue now in case we are considering this "start with a million clusters" idea. That will be impractical unless it can more quickly pare down the number of clusters.

INFO klustakwik: Iteration 97F: 811 clusters, 3443 changed, score=-1583280187.322291 (decreased by -44870.075064) INFO klustakwik: Deleting cluster 666 (960 points): improves score by 58087.3913987 INFO klustakwik: Iteration 98F: 810 clusters, 3420 changed, score=-1583346519.560857 (decreased by 66332.238566) INFO klustakwik: Deleting cluster 659 (1771 points): improves score by 60677.0776336 INFO klustakwik: Iteration 99F: 809 clusters, 4263 changed, score=-1583349330.125175 (decreased by 2810.564317) INFO klustakwik: Deleting cluster 439 (12 points): improves score by 2821.16777563 INFO klustakwik: Iteration 100F: 808 clusters, 3025 changed, score=-1583296095.278437 (decreased by -53234.846738) INFO klustakwik: Deleting cluster 532 (1 points): improves score by 245656.853454 INFO klustakwik: Iteration 101F: 807 clusters, 2376 changed, score=-1583301681.809656 (decreased by 5586.531219) INFO klustakwik: Deleting cluster 452 (251 points): improves score by 64193.7042065 INFO klustakwik: Iteration 102F: 806 clusters, 2348 changed, score=-1583368975.173469 (decreased by 67293.363812) INFO klustakwik: Deleting cluster 463 (392 points): improves score by 2748.25345516 INFO klustakwik: Iteration 103F: 805 clusters, 2468 changed, score=-1583306184.532981 (decreased by -62790.640488) INFO klustakwik: Deleting cluster 291 (302 points): improves score by 2732.28269553 INFO klustakwik: Iteration 104F: 804 clusters, 2223 changed, score=-1583311165.896126 (decreased by 4981.363145) INFO klustakwik: Deleting cluster 727 (23 points): improves score by 2699.65647888 INFO klustakwik: Iteration 105F: 803 clusters, 1852 changed, score=-1583316484.261275 (decreased by 5318.365149) INFO klustakwik: Deleting cluster 412 (1097 points): improves score by 62076.7187171 INFO klustakwik: Iteration 106F: 802 clusters, 2665 changed, score=-1583380019.370767 (decreased by 63535.109492) INFO klustakwik: Deleting cluster 130 (2206 points): improves score by 45009.4978437 INFO klustakwik: Iteration 107F: 801 clusters, 3949 changed, score=-1583360666.323571 (decreased by -19353.047196) INFO klustakwik: Deleting cluster 446 (843 points): improves score by 67521.9796369 INFO klustakwik: Iteration 108F: 800 clusters, 3614 changed, score=-1583380789.840912 (decreased by 20123.517341) INFO klustakwik: Deleting cluster 667 (4438 points): improves score by 48945.2908864 INFO klustakwik: Iteration 109F: 799 clusters, 6734 changed, score=-1583369644.089056 (decreased by -11145.751856) INFO klustakwik: Deleting cluster 626 (2204 points): improves score by 24971.3849628 INFO klustakwik: Iteration 110F: 798 clusters, 5907 changed, score=-1583344573.970613 (decreased by -25070.118443) INFO klustakwik: Deleting cluster 679 (1235 points): improves score by 58692.3001728 INFO klustakwik: Iteration 111F: 797 clusters, 5161 changed, score=-1583373783.903421 (decreased by 29209.932808) INFO klustakwik: Deleting cluster 267 (1283 points): improves score by 63954.8479295 INFO klustakwik: Iteration 112F: 796 clusters, 4618 changed, score=-1583383984.047065 (decreased by 10200.143644) INFO klustakwik: Deleting cluster 706 (2870 points): improves score by 55832.1068685 INFO klustakwik: Iteration 113F: 795 clusters, 5810 changed, score=-1583377023.057128 (decreased by -6960.989937) INFO klustakwik: Deleting cluster 697 (1828 points): improves score by 42014.7501283 INFO klustakwik: Iteration 114F: 794 clusters, 5223 changed, score=-1583362039.470591 (decreased by -14983.586538) INFO klustakwik: Deleting cluster 606 (1564 points): improves score by 51588.2478426 INFO klustakwik: Iteration 115F: 793 clusters, 5057 changed, score=-1583365973.112942 (decreased by 3933.642351) INFO klustakwik: Deleting cluster 663 (1199 points): improves score by 66282.8353379 INFO klustakwik: Iteration 116F: 792 clusters, 4870 changed, score=-1583379658.609888 (decreased by 13685.496947) INFO klustakwik: Deleting cluster 677 (661 points): improves score by 2724.29611254 INFO klustakwik: Iteration 117F: 791 clusters, 4039 changed, score=-1583388516.660150 (decreased by 8858.050262) INFO klustakwik: Deleting cluster 145 (5261 points): improves score by 42458.2782378 INFO klustakwik: Iteration 118F: 790 clusters, 8445 changed, score=-1583365603.007160 (decreased by -22913.652990) INFO klustakwik: Deleting cluster 710 (3695 points): improves score by 19257.2251148 INFO klustakwik: Iteration 119F: 789 clusters, 8578 changed, score=-1583338145.870308 (decreased by -27457.136852) INFO klustakwik: Deleting cluster 113 (2563 points): improves score by 59295.3662486 INFO klustakwik: Iteration 120F: 788 clusters, 8728 changed, score=-1583374565.313740 (decreased by 36419.443432) INFO klustakwik: Trying to split clusters INFO klustakwik: Split into 799 clusters INFO klustakwik: Deleting cluster 480 (2555 points): improves score by 12848.5362978 INFO klustakwik: Iteration 121F: 799 clusters, 10308 changed, score=-1583335287.944449 (decreased by -39277.369292) INFO klustakwik: Deleting cluster 796 (825 points): improves score by 66476.266953 INFO klustakwik: Iteration 122F: 798 clusters, 7463 changed, score=-1583387259.942601 (decreased by 51971.998153) INFO klustakwik: Deleting cluster 526 (1203 points): improves score by 43914.0074992 INFO klustakwik: Iteration 123F: 797 clusters, 5957 changed, score=-1583373654.598751 (decreased by -13605.343850) INFO klustakwik: Deleting cluster 307 (1281 points): improves score by 63485.0466793 INFO klustakwik: Iteration 124F: 796 clusters, 5903 changed, score=-1583387250.780976 (decreased by 13596.182225) INFO klustakwik: Deleting cluster 239 (4 points): improves score by 63907.3052289 INFO klustakwik: Iteration 125F: 795 clusters, 4411 changed, score=-1583330686.296948 (decreased by -56564.484027)

thesamovar commented 9 years ago

This is unfortunately unavoidable with the new low-memory KK2 design, although we should indeed be aware of the issue.

nsteinme commented 9 years ago

Ok, no problem. It just makes starting with one cluster per unique mask impossible, I think...

thesamovar commented 9 years ago

Not necessarily, because a cluster can get emptied by all the points migrating away from it without it being deleted. I guess we'll see when we try this strategy.

nsteinme commented 9 years ago

I see, so you're saying for me to be seeing one cluster lost per iteration, these clusters are not being lost in this emptied-out way, but through a different route - and if they were emptied out, I'd get more than one lost per iteration. Sounds good, got it.

On Mon, Jun 1, 2015 at 5:30 PM, Dan Goodman notifications@github.com wrote:

Not necessarily, because a cluster can get emptied by all the points migrating away from it without it being deleted. I guess we'll see when we try this strategy.

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/klustakwik2/issues/47#issuecomment-107627283 .

shabnamkadir commented 9 years ago

I had tried this approach early on when developing mask starts - initialising with the unique masks . It worked very well and the error rates were very robust. I stopped using it for larger dataset due to RAM issues. Now however, with so many savings, it is certainly worth trying again and faster than going through many TrySplits.

thesamovar commented 9 years ago

Well as soon as someone is willing to put the time in, the idea can already be tested in the branch combine_m_ec in which RAM usage doesn't depend on the number of clusters.

nsteinme commented 9 years ago

So do you think trysplits is completely unnecessary, then, and should be skipped? Trysplits isn't going to go any faster if it just doesn't have any splitting to do, will it?

On Mon, Jun 1, 2015 at 5:57 PM, Dan Goodman notifications@github.com wrote:

Well as soon as someone is willing to put the time in, the idea can already be tested in the branch combine_m_ec in which RAM usage doesn't depend on the number of clusters.

— Reply to this email directly or view it on GitHub https://github.com/kwikteam/klustakwik2/issues/47#issuecomment-107637063 .

thesamovar commented 9 years ago

It could be the case that with a much larger number of starting clusters that splitting will be less necessary. My guess: we'll still want to run it but maybe less often (e.g. maybe after 100 or 150 iterations instead of every 40). Difficult to tell until we try it.