milesgranger / gap_statistic

Dynamically get the suggested clusters in the data for unsupervised learning.
The Unlicense
217 stars 46 forks source link

Advanced method to find k + Gap* alternative measure + documentation #28

Closed shaypal5 closed 5 years ago

shaypal5 commented 5 years ago

This pull request is meant to address four issues (three of which I've opened, just for order's sake):

  1. Issue #14, namely implementing step 3 from the original Gap statistic paper, which is the authors' recommended method to choose k. This part is partly based on code found in a fork of this repo by a user named druogury, which implements this step (among other enhancements).
  2. Issue #25, in which I suggested the possible enhancement of also providing the alternative Gap* measure (link) in the resulting gap_df attribute. This is done without effecting the results of the calculation.
  3. Issue #26, in which I suggested to add documentation to the readme file, presenting concisely both the package's API and the method on which it is based.
  4. Issue #27, in which I suggested to add a plotting function implementing the plotting logic found in the example notebook.

To this end, this pull request changes exactly three files:

  1. .gitignore - Just adds ignoring of .swp files (for people working with vim, like myself) and of .DS_Store files (for people working on mac, like myself).
  2. optimalK.py - Adds:
    • The calculation of measures used for the aforementioned step 3 and Gap* in the OptimalK._calculate_gap() method.
    • Their additions to the gap_df attribute - and additional calculations which cannot be done in a k-specific function - in the OptimalK.__call__() method.
    • The plotting of four relevant plots (including the Gap-vs-n one from the example notebook) in the newly added OptimalK.plot_results() method.
  3. README.md - Adds documentation for the original suggested method (very basic; more of a mention), the package's API and all new features. This is partial, but a good start, I believe.

Examples images are below.

An important note is that while this works great on my machine, I was unable to test it as it seems the azure-pipelines-based CI solutions you recently switched to is not working. I think this should be addressed, but you can also just pull this - this code is really safe, doing nothing not already done by the package.

If I can somehow help with getting testing to work again I'd love to land a hand. I would suggest adding travis back as an additional platform to run tests, regardless of what happens with Azure Pipelines; it's free, and you don't have to put the badge, but it can help with development in the meantime.

An example of the enhanced dataframe can be seen here: image

And of the new plots here: image image image image

milesgranger commented 5 years ago

Fantastic!

Thank you so much for the PR and addressing so many issues, this is really great! I'll definitely set aside some time in the near future and review this in detail, but from initial scan, yes, it seems pretty safe and good improvements. :100:

So far as AzurePipelines, I've used it to publish the last few releases, and I'm slightly perplexed as to why it has decided to fall over on itself. I've used Travis before on this project and many others, but wanted to test on all three OS, and also, I don't really approve of what happened in Travis recently on a personal level.

Anyhow, I'll poke around AzurePipelines a bit and see what the deal is, otherwise I'll probably merge this in and go with CircleCI and then release a new version.

shaypal5 commented 5 years ago

Hey Miles,

Thanks for the enthusiastic response. :)

I see you made a quick review. I'll go back and fix things before submitting this for PR again.

And tests are working again, so I'll take a look at the results once they're done.

milesgranger commented 5 years ago

Great!

Also, I see that the more tricky failure is processing with Rust. If you are interested in attempting to add the calculations there as well, the relevant area is around https://github.com/milesgranger/gap_statistic/blob/ab432b7b55d138f7c109ae2084cac19f61734b51/src/gap_statistic/mod.rs#L90 Otherwise, for now, I would be ok with raising an error if the backend is Rust, and before the next release, I can add to the rust side of things; or maybe even pushing a commit to this PR if I have time tomorrow.

milesgranger commented 5 years ago

Hi Shay, I put together a patch for the rust stuff that you can apply to your PR.

After addressing the black formatting test failure and the comments, we should be good to go! :+1:

shaypal5 commented 5 years ago

I applied the patch and fixed everything mentioned in your code review, but the rust crate test is still failing everywhere. :(

milesgranger commented 5 years ago

That is completely my fault. After fixing the rist side I only checked that the Python tests passed. Leaving now for vacation but will have a patch to you this evening. Sorry again.

milesgranger commented 5 years ago

But just from looking, it seems you can just remove line 66 from src/tests/mod.rs and it should pass

milesgranger commented 5 years ago

Because I forgot to implement debug for the GapCalc struct. That's all.

shaypal5 commented 5 years ago

Trying this just now!

shaypal5 commented 5 years ago

It now seems to fail on Black being angry about optimalK.py's formatting. I've ran it through Black and committed again.

shaypal5 commented 5 years ago

Passed!🎉

milesgranger commented 5 years ago

Most excellent! Can you split this into 3-4 commits based on the points this PR was addressing, and I'll merge it in later today hopefully.

shaypal5 commented 5 years ago

How do I do this splitting? :)

shaypal5 commented 5 years ago

4 commits or 4 PRs?

milesgranger commented 5 years ago

4 commits. If you have troubles doing that, I don't have any strong feelings against squashing and merging as one commit, just a preference to split the commits per issue.

milesgranger commented 5 years ago

Put it this way, if you don't do it by this evening I'll squash and merge it. 😉👍 But in general, don't know how comfortable you are with git. One would typically do a soft reset on master and then add the chunks as they pertain to given issues. 👍

shaypal5 commented 5 years ago

Sorry, I was off the grid for a couple of days! :)

I don't know how to perform the procedure you described, but I would've learned. Nevermind. :)

Thank you for all the support with this PR!

milesgranger commented 5 years ago

No worries, I'm on vacation and didn't want to keep you waiting. Had some free time last night so all taken care of! Thanks again. 👌