merely-useful / py-rse

Research Software Engineering with Python course material
http://third-bit.com/py-rse/
Other
246 stars 63 forks source link

Good examples and data sets #14

Closed gvwilson closed 5 years ago

gvwilson commented 5 years ago

Please add notes to this issue pointing at good data sets and/or describing good examples or exercises that you would like to see in the lessons (at any level).

ChristinaLK commented 5 years ago

One of my local colleagues did a Python plotting lesson on CO2 data measured from Mauna Loa, showing both the annual flux of carbon in the atmosphere and the trend of rising carbon over time.

His repository is here: https://github.com/megarcia/SWC_Python

DamienIrving commented 5 years ago

@ChristinaLK We could possibly make the CO2 analysis story a little more interesting by looking at data from all three of the Premier Global Baseline Stations:

The cool thing being that because of the long residence time of greenhouse gases in the atmosphere (~100 years), they are more or less uniformly distributed around the globe and you get a similar result no matter where you measure from (provided your sample isn't contaminated by nearby emission sources, which is why the premier stations are in remote places). There is more vegetation in the Northern Hemisphere (which is responsible for the annual cycle you mention of the planet essentially breathing out during boreal autumn and in during boreal spring), so there will be some subtle differences between the monthly Cape Grim timeseries and the Alert / Mauna Loa timeseries that we could draw out during the lessons (Cape Grim will have a smaller seasonal cycle).

I guess the only potential downside is that the data won't require much cleaning up. There would be some work involved in merging the raw data files from all three sites into a consistent format and there are missing values for some months early in the record, but that's about it as far as cleaning goes.

gvwilson commented 5 years ago

I think our learners will thank us if we show them how to manipulate clean data first and only then dive into the messiness of reality :-) The climate data would be very cool.

mbonsma commented 5 years ago

@joelostblom mentioned this I believe, but here's what I found when I searched 'AirBnB data': http://insideairbnb.com/get-the-data.html. As far as I understand, it's a collection of publicly available listing data scraped from AirBnB's website. Here's a visualization from the website for Toronto, for example.

Pros:

Cons:

cwickham commented 5 years ago

In a similar vein to the CO2 data I've adapted parts of a chapter on "Revealing Change" in Alberto Cairo's "Truthful Art: data, charts and maps for communication" to US employment data.

I say similar to the CO2 data, since it's also a time series where the salient features include trend, seasonal patterns, and noise.

What I love about the chapter is that the motivation is entirely on getting as accurate an understanding of the variation over time as possible (motivated by a misleading plot), but the process of doing so concretely demonstrates the power of combining small data manipulation steps with visualization.

As some examples of the range of tasks involved (from simple to complex):

Cairo's works through Spanish data on Social Security enrollment and the Spanish population between 16-64. In class, I've used the closest US equivalents I could find:

cwickham commented 5 years ago

I've also found replicating graphics from data in the fivethirtyeight R package to be a fun way to motivate practice with data manipulation and visualization.

For example, I've used this case study after covering ggplot2, dplyr and tidyr

lwjohnst86 commented 5 years ago

As per #126 and our last meeting, closing this for now.