paultopia / quantitative-methods-for-lawyers

Introduction to Quantitative and Computational Legal Reasoning
Other
12 stars 2 forks source link

Plotting library? #8

Open paultopia opened 5 years ago

paultopia commented 5 years ago

Suggestions wanted: what python plotting library should this class use? Exploratory visualization is going to be a major theme of this course, and I intend to use it to provide concrete statistical instruction as well---for example, my current plan is to build out understanding of statistical distributions via histograms (i.e., get them to produce histograms of simulated data for a normal, see how it changes with different parameters, then ditto with other distributions, understand z scores and p values by manipulating the normal histograms, etc.)

Options:

Seaborn

Positives: easy to produce attractive, interesting plots, popular so there's tons of info about how to use.

Negatives: the API seems to have changed a lot in the last major version, and the version that comes pre-installed on Azure notebooks is obsolete; I had a kind of ugly experience trying to make sample code work. And the point of using a single cloud environment is to protect students from this kind of ugliness.

Plotly

Positives: dead simple. Produces interactive plots. Those plots are attractive.

Negatives: not terribly customizable, it might be hard to build out incremental teaching material using it---not sure if there's a way to, for example, put vertical lines and shading at different percentiles of a plot to indicate where standard p-value cutoff points are. Also, it's fundamentally a commercial product and it feels designed to nudge people into paid accounts.

Bokeh

Positives: also pretty simple, also interactive, albeit less attractive.

Negatives: I'm not sure how customizable it is, mainly because I don't personally know the library all that well. I'd have to dig in a good bit, and that would be time consuming. Not matplotlib based I believe, which potentially limits powerful options.

Raw matplotlib

Positives: you can do anything.

Negatives: It's stupidly difficult.

Pandas built-in plotting

Positive: it's a shockingly good plotting library on its own, and probably customizable because it just wraps matplotlib.

Negative: None that I can think of. It's Pandas.

Altair

Positives: Not sure, I don't know it really.

Negatives: not all dependencies are installed from the get-go. I don't know it really. Probably shouldn't try to use a library I have never used when visualization is a core part of the class.

Ggplot

Positives: I guess ggplot-style stuff is kind of an industry standard?

Negatives: I kind of hate the grammar of graphics. There are a bunch of different python ports of ggplot (though I've heard good things about plotnine)

Make my own damn library

Positives: total control. I can make sure students have the same version by only having one version available. I can just build in simple functions to generate every single thing I want. I should probably have a course library anyway, as this seems to be a fairly effective teaching method for incremental learning (e.g., fast.ai library wrapping pytorch, how to design programs class with racket learning languages, both of those courses are legendarily good). I already have a minimal plotting library built in raw matplotlib that I can build off of.

Disadvantages: a ton of work. Doesn't completely avoid problem of obsolete/wonky versions of things, since matplotlib versions tend to be problematic, I can't imagine what it's like trying to upgrade matplotlib on azure notebooks, and so I might have to pin this library to whatever version of matplotlib is on there. Which is likely to be incredibly annoying and inconvenient. Though, who knows, maybe azure comes with a modern version of matplotlib? We should be so lucky... (Actually, I just checked, it seems to come with 3.0.0, which is recent enough probably.. and plottyprint works, so there's that!)

I'm leaning toward the own library solution, with pandas as a backup, but thoughts?

warrenagin commented 5 years ago

Take a look at Pixiedust from IBM Watson. It creates an in-notebook dashboard to do visualizations. Warren Agin Warren Agin

On Sat, Nov 24, 2018, 5:10 PM Paul Gowder <notifications@github.com wrote:

Suggestions wanted: what python plotting library should this class use? Exploratory visualization is going to be a major theme of this course, and I intend to use it to provide concrete statistical instruction as well---for example, my current plan is to build out understanding of statistical distributions via histograms (i.e., get them to produce histograms of simulated data for a normal, see how it changes with different parameters, then ditto with other distributions, understand z scores and p values by manipulating the normal histograms, etc.)

Options:

Seaborn

Positives: easy to produce attractive, interesting plots, popular so there's tons of info about how to use.

Negatives: the API seems to have changed a lot in the last major version, and the version that comes pre-installed on Azure notebooks is obsolete; I had a kind of ugly experience https://github.com/paultopia/quantitative-methods-for-lawyers/blob/master/lessons/tbd/gotcha.ipynb trying to make sample code work. And the point of using a single cloud environment is to protect students from this kind of ugliness.

Plotly

Positives: dead simple. Produces interactive plots. Those plots are attractive.

Negatives: not terribly customizable, it might be hard to build out incremental teaching material using it---not sure if there's a way to, for example, put vertical lines and shading at different percentiles of a plot to indicate where standard p-value cutoff points are. Also, it's fundamentally a commercial product and it feels designed to nudge people into paid accounts.

Bokeh

Positives: also pretty simple, also interactive, albeit less attractive.

Negatives: I'm not sure how customizable it is, mainly because I don't personally know the library all that well. I'd have to dig in a good bit, and that would be time consuming. Not matplotlib based I believe, which potentially limits powerful options.

Raw matplotlib

Positives: you can do anything.

Negatives: It's stupidly difficult.

Pandas built-in plotting

Positive: it's a shockingly good plotting library on its own, and probably customizable because it just wraps matplotlib.

Negative: None that I can think of. It's Pandas.

Altair

Positives: Not sure, I don't know it really.

Negatives: not all dependencies are installed from the get-go. I don't know it really. Probably shouldn't try to use a library I have never used when visualization is a core part of the class.

Ggplot

Positives: I guess ggplot-style stuff is kind of an industry standard?

Negatives: I kind of hate the grammar of graphics. There are a bunch of different python ports of ggplot (though I've heard good things about plotnine)

Make my own damn library

Positives: total control. I can make sure students have the same version by only having one version available. I can just build in simple functions to generate every single thing I want. I should probably have a course library anyway, as this seems to be a fairly effective teaching method for incremental learning (e.g., fast.ai library wrapping pytorch, how to design programs class with racket learning languages, both of those courses are legendarily good). I already have a minimal plotting library https://github.com/paultopia/plottyprint built in raw matplotlib that I can build off of.

Disadvantages: a ton of work. Doesn't completely avoid problem of obsolete/wonky versions of things, since matplotlib versions tend to be problematic, I can't imagine what it's like trying to upgrade matplotlib on azure notebooks, and so I might have to pin this library to whatever version of matplotlib is on there. Which is likely to be incredibly annoying and inconvenient. Though, who knows, maybe azure comes with a modern version of matplotlib? We should be so lucky... (Actually, I just checked, it seems to come with 3.0.0, which is recent enough probably.. and plottyprint works, so there's that!)

I'm leaning toward the own library solution, with pandas as a backup, but thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paultopia/quantitative-methods-for-lawyers/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AQTLwAudu5KdhbwMQfy7IZdRoecfgkxRks5uycPFgaJpZM4YxnFB .

paultopia commented 5 years ago

Oh yeah, I think we looked at this before! Thanks!


Paul Gowder https://gowder.io

My book: The Rule of Law in the Real World http://rulelaw.net

On Nov 24, 2018, at 4:16 PM, Warren Agin notifications@github.com wrote:

Take a look at Pixiedust from IBM Watson. It creates an in-notebook dashboard to do visualizations. Warren Agin Warren Agin

On Sat, Nov 24, 2018, 5:10 PM Paul Gowder <notifications@github.com wrote:

Suggestions wanted: what python plotting library should this class use? Exploratory visualization is going to be a major theme of this course, and I intend to use it to provide concrete statistical instruction as well---for example, my current plan is to build out understanding of statistical distributions via histograms (i.e., get them to produce histograms of simulated data for a normal, see how it changes with different parameters, then ditto with other distributions, understand z scores and p values by manipulating the normal histograms, etc.)

Options:

Seaborn

Positives: easy to produce attractive, interesting plots, popular so there's tons of info about how to use.

Negatives: the API seems to have changed a lot in the last major version, and the version that comes pre-installed on Azure notebooks is obsolete; I had a kind of ugly experience https://github.com/paultopia/quantitative-methods-for-lawyers/blob/master/lessons/tbd/gotcha.ipynb trying to make sample code work. And the point of using a single cloud environment is to protect students from this kind of ugliness.

Plotly

Positives: dead simple. Produces interactive plots. Those plots are attractive.

Negatives: not terribly customizable, it might be hard to build out incremental teaching material using it---not sure if there's a way to, for example, put vertical lines and shading at different percentiles of a plot to indicate where standard p-value cutoff points are. Also, it's fundamentally a commercial product and it feels designed to nudge people into paid accounts.

Bokeh

Positives: also pretty simple, also interactive, albeit less attractive.

Negatives: I'm not sure how customizable it is, mainly because I don't personally know the library all that well. I'd have to dig in a good bit, and that would be time consuming. Not matplotlib based I believe, which potentially limits powerful options.

Raw matplotlib

Positives: you can do anything.

Negatives: It's stupidly difficult.

Pandas built-in plotting

Positive: it's a shockingly good plotting library on its own, and probably customizable because it just wraps matplotlib.

Negative: None that I can think of. It's Pandas.

Altair

Positives: Not sure, I don't know it really.

Negatives: not all dependencies are installed from the get-go. I don't know it really. Probably shouldn't try to use a library I have never used when visualization is a core part of the class.

Ggplot

Positives: I guess ggplot-style stuff is kind of an industry standard?

Negatives: I kind of hate the grammar of graphics. There are a bunch of different python ports of ggplot (though I've heard good things about plotnine)

Make my own damn library

Positives: total control. I can make sure students have the same version by only having one version available. I can just build in simple functions to generate every single thing I want. I should probably have a course library anyway, as this seems to be a fairly effective teaching method for incremental learning (e.g., fast.ai library wrapping pytorch, how to design programs class with racket learning languages, both of those courses are legendarily good). I already have a minimal plotting library https://github.com/paultopia/plottyprint built in raw matplotlib that I can build off of.

Disadvantages: a ton of work. Doesn't completely avoid problem of obsolete/wonky versions of things, since matplotlib versions tend to be problematic, I can't imagine what it's like trying to upgrade matplotlib on azure notebooks, and so I might have to pin this library to whatever version of matplotlib is on there. Which is likely to be incredibly annoying and inconvenient. Though, who knows, maybe azure comes with a modern version of matplotlib? We should be so lucky... (Actually, I just checked, it seems to come with 3.0.0, which is recent enough probably.. and plottyprint works, so there's that!)

I'm leaning toward the own library solution, with pandas as a backup, but thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paultopia/quantitative-methods-for-lawyers/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AQTLwAudu5KdhbwMQfy7IZdRoecfgkxRks5uycPFgaJpZM4YxnFB .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

WithPrecedent commented 5 years ago

I would go with seaborn. The graphs are pretty (which matters for any future applications of law students). It fixes much of the messy API of matplotlib and it's already included with Anaconda. I looked at your bad experience and can't quite make sense of what happened. My limited experience with seaborn has been pretty good (especially compared to ggplot in R). I agree that pandas plot is a decent alternative if you aren't looking at highly customized graphs. And I assume you are teaching a lot of pandas anyway. So, it's a decent second choice, imo. I definitely wouldn't use bokeh because it is too strongly directed at interactive, online graphs. Creating your own library seems insane in terms of opportunity cost.

paultopia commented 5 years ago

What happened in the linked file is that the seaborn that azure notebooks had was 0.8.0, but the version that all the existing documentation is about was 0.9.0, and the API changed---and it's kind of obnoxious to get jupyter to reload libraries after upgrading. So one advantage of having my own library is that I can just specify exact dependencies in it, wrap the API of everything, and then not stress about such things.

The real question might be how much obnoxiousness I want to subject my students to, versus opportunity cost. The main advantage of own library is the ability to integrate vis really closely with teaching, like having a version of seaborn's kdeplot with horizontal lines and shadings for different quantiles, so they can have a visual reference for things like z-scores. But that might not be worth it, effort wise, it's true.

currently, I'm again swearing about using python, as I'm on hour 3 of an attempt to replicate the azure notebook environment on my home computer with pipenv, and have run into every. single. possible. dependency. error. Sigh. There's still time to switch to R...

warrenagin commented 5 years ago

The problem with building your own library is that you are then teaching a skill that doesn't transfer. -Warren Agin

On Sun, Nov 25, 2018 at 3:42 PM Paul Gowder notifications@github.com wrote:

What happened in the linked file is that the seaborn that azure notebooks had was 0.8.0, but the version that all the existing documentation is about was 0.9.0, and the API changed---and it's kind of obnoxious to get jupyter to reload libraries after upgrading. So one advantage of having my own library is that I can just specify exact dependencies in it, wrap the API of everything, and then not stress about such things.

The real question might be how much obnoxiousness I want to subject my students to, versus opportunity cost. The main advantage of own library is the ability to integrate vis really closely with teaching, like having a version of seaborn's kdeplot with horizontal lines and shadings for different quantiles, so they can have a visual reference for things like z-scores. But that might not be worth it, effort wise, it's true.

currently, I'm again swearing about using python, as I'm on hour 3 of an attempt to replicate the azure notebook environment on my home computer with pipenv, and have run into every. single. possible. dependency. error. Sigh. There's still time to switch to R...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/paultopia/quantitative-methods-for-lawyers/issues/8#issuecomment-441471433, or mute the thread https://github.com/notifications/unsubscribe-auth/AQTLwN8rsKNhN-mF9fjb71m_Dlrx9JJUks5uywC-gaJpZM4YxnFB .