Develop main 'data project'

grattan / R_at_Grattan

Using R at Grattan

https://grattan.github.io/R_at_Grattan/

Creative Commons Zero v1.0 Universal

7 stars 3 forks source link

Develop main 'data project' #6

Closed wfmackey closed 2 years ago

wfmackey commented 5 years ago

Develop a main 'data project' that the documentation will follow.

The data needs to allow:

read messy Excel files: readxl::read_excel
read csv files: read_csv
join data from different sources; esp from readabs
visualise via bar, point, maps (absmapsdata)
generate summary statistics by group
use weights
run some regression analysis
use gather and spread

I also think it should be publicly accessible (so no ABS microdata) and reasonably large (large enough to make it 'worth it').

MattCowgill commented 5 years ago

Table 16 here has labour force stats (unemp rate, etc. etc.) by SA4 for each month of the past 21 years; that could be useful: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/6291.0.55.001Jun%202019?OpenDocument

Works with readabs too. I think it might be 2011 ASGS but not sure.

wfmackey commented 5 years ago

ABS puts out a bunch of separate Excel files by SA2-SA4 in their 'Data by Region' publication: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1410.02013-18?OpenDocument

It looks gross, but pretty detailed. Runs from 2011-2018.

wfmackey commented 5 years ago

Works with readabs too. I think it might be 2011 ASGS but not sure.

Coool. Maybe the project could be building profiles of SA4s in Australia over time: pop, income, LFS, etc.

MattCowgill commented 5 years ago

Yep although I think it would be best if we had some concise research question to answer, rather than just building profiles. Something like "did areas with high unemployment swing against the government" or whatever (not that, but something like that)

wfmackey commented 5 years ago

Yes! That's important.

So let's use ABS data on an SA4 level, that can be joined with LFS via readabs, and joined with polling booth data.

We could explore how areas with high unemployment (or who are in Sahm-recessions) vote

wfmackey commented 5 years ago

This structure also allows a sub-subsection on how-to-best download data from TableBuilder.

jonathananolan commented 5 years ago

I think we should come up with a project that has layers which allow the introduction of more complex analysis, but with a great pay off at the end of each relatively short section. Cleaning a non tidy dataset should not be first because it's boring and conceptually difficult.

VISTA is a great tidy dataset - but we could use another one as long as it has geographical and temporal elements.

So the start could be:

S1: What's the mode share in Melbourne?

Importing a CSV.
Pipes
Group_by and summarise
Write_csv

s2 - Is the mode share different for women and men?

*_join
Simple bar chart in GGPLOT.
Getting your project ready for QC

s3 - Is the mode share changing over time?

lubridate.

s4 Is there a relationship between unemployment and mode share?

Cleaning messy ABS data frames
Creating maps

And on from there...

MattCowgill commented 5 years ago

I like this idea, though I'm not sure how far down the path of using a different data set/ project @wfmackey is.

One thing I don't like about your structure @jonathananolan is "getting your project ready for QC" sitting where it is. We really want to stress the importance of project setup, folder structure, etc. so this will come before the data work.