memphis-iis / datawhys-content-notebooks-python

Content for DataWhys in the form of JupyterLab notebooks (.ipynb files)
Apache License 2.0
8 stars 2 forks source link

Notebook: Logistic regression #10

Closed aolney closed 4 years ago

aolney commented 4 years ago

See the spreadsheet for details

Content Programming
VR VR

Ideas/prereqs: Binomial distribution (brief mention of distributions in general), Regression vs classification, confusion matrix, accuracy, precision/recall

Direct link https://jupyter.olney.ai/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fmemphis-iis%2Fdatawhys-content-notebooks&subPath=Logistic-regression.ipynb&app=lab

vasilerus commented 4 years ago

I added the theory part. This should now be done.

aolney commented 4 years ago

Great, thanks :) Apologies I've been focusing on the ones that were closer, so I doubly appreciate you being proactive on this.

aolney commented 4 years ago

Reopening to help me track your latest changes

aolney commented 4 years ago

Some thoughts on https://github.com/memphis-iis/datawhys-content-notebooks/pull/49

vasilerus commented 4 years ago

The role of the histograms is to develop a general habit to look at the distribution of the values for the predictors/features. It's always a good idea to do that to notice anything unusual such as outliers, etc.

On Wed, Jun 17, 2020 at 1:05 AM Andrew M Olney notifications@github.com wrote:

Some thoughts on #49 https://github.com/memphis-iis/datawhys-content-notebooks/pull/49

  • Updated to remove material already covered in previous days
  • Not sure how to articulate purpose of histograms, since the analysis does not seem to adjust to them
  • Would probably be nice to add odds ratio interpretation of the coefficients, but I'm out of gas for today

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/memphis-iis/datawhys-content-notebooks/issues/10#issuecomment-645171222, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIRI3HM64A3ZWXGAEHQSS3RXBMLJANCNFSM4M2B6M2Q .

aolney commented 4 years ago

Totally agree; this is something I do routinely myself. But I wasn't sure how you wanted to frame it in this specific notebook, for both the histograms and the correlation matrix. Here are some thoughts:

  1. General data integrity
  2. Possible transformations of data
  3. Discarding variables

Right now it seems more aligned with 1 than the others. If we wanted to enhance their understanding of the effect of replacing missing values with the median, we could try before/after comparison plots. With regard to 2, we could use it as an opportunity to introduce transformations like log or sqrt. With regard to 3, we could add discussion about looking at which variables correlate strongly with the class label (do any? I don't recall that) and with each other (none).

aolney commented 4 years ago

Closing for now to clear the board with https://github.com/memphis-iis/datawhys-content-notebooks/commit/749c6dd447f8fc177067de5c3469676cf7adacc8

We can continue discussion as it's closed :smile:

One thing I added to the PM notebook is interpreting the coefficients.