public-salaries / public_salaries

Public sector employee salaries
16 stars 1 forks source link

Illinois Teacher's Salary from 1999--2012 #3

Open soodoku opened 7 years ago

soodoku commented 7 years ago

Source =

http://www.familytaxpayers.org/ftf/ftf_salaries.php

For each year, list all districts. Each school in the district brings you to a clickable list of teacher and salary. Each teacher's name is clickable and gets meta data on the teacher.

Useful to produce year by year lists for now. We can merge later.

ChrisMuir commented 7 years ago

@soodoku I got started on this, I've written a script that will get all of the district links for each year, and then for each set of district links it will go through each one and get all of the teacher links. I've gotten all of the district links, but have only gotten the teacher links for the first year's worth of districts (2012), because the sheer number of requests being made to the website is starting to grow exponentially. The number of years is 14, number of district links is 14816, number of teacher links for only 2012 is 162471. Pulling the teacher links for 2012 (14816 total requests) takes about a half hour.....grabbing all of the data from each teacher link is going to take 70 - 80 hours of constantly pinging the website... ((162K 14) / 15K) 0.5h = 75h.

I've written code to scrape all the teacher links, iterate over them and grab the meta data for each link, but I don't know if I feel right about blasting their server to the level that it would take to get all that data.

For now, I'll push the script file to the repo.

soodoku commented 7 years ago

I see. Worry about putting load on their server is reasonable. Two aspects to that:

  1. The webpages are simple enough that I don't see a huge impact on bandwidth per request. Plus, unlimited bandwidth plans with hosting companies are common.
  2. The big concern is too many requests at the same time. We aren't doing that.

But still makes sense to a) go year by year, and b) do Sys.sleep(1) between requests.

We can also email them to ask for the data. I am not v. optimistic that we will get something.

What do you think?

soodoku commented 7 years ago
ChrisMuir commented 7 years ago

Yeah I have calls to Sys.sleep between each request. The robots.txt file doesn't mention /ftf/....I guess it's fine, as long as we go easy on them. It'll take some time to complete, I can probably set up a single year to run over night, and just do that until we have each of the years done.

I'm going out of town Wednesday morning thru Monday, so I probably won't be able to start that process until next week.

If you or anyone else on the team starts it and have questions about the code, feel free to let me know.

ChrisMuir commented 7 years ago

So I worked on this some today, quick update....I let the script run overnight to scrape all of the teacher links from each of the district links. There's a total of 2,226,915 unique teacher links (keep in mind, one teacher can have multiple links, as the data is split up by year). Assuming 0.5 seconds of computation/rvest time per request, and including a Sys.sleep of 2 seconds per request, we're looking at 1546.46 hours, or 64.4 days, of non-stop scrape time required to complete the task.

Should we consider narrowing our focus on this? Maybe limit the years to the five most recent years (2008 - 2012), or filter the 2mil+ links to only include unique teachers (keeping only the most recent instance of each teacher)?

Let me know what you think. I made a few general refactor edits to the scrape code today, I'll push that to the repo now.

soodoku commented 7 years ago

Awesome @ChrisMuir!

2.2M teacher-years is a lot! I agree that we should start out small. Probably do 2012 first and then go back in time slowly. One year at one time makes sense to me. And we can do it over next many ways.

p.s. There are some odd things in the data including $0 salaries.

ChrisMuir commented 7 years ago

Cool, yeah I'm letting it run on the 2012 teacher links for now. Once that's done, I'll write those results to csv and upload to the repo. We can take a look at that data and decide what to do from there. Thanks!

ChrisMuir commented 6 years ago

Just pushed the 2012 IL teacher salaries to the repo. The data came out very clean from the website, there were over 162K records scraped and every single one returned as a neat 10 variable data frame, all with the same col headers. It made binding them all up into a single data set headache-free.

I'll start the script on 2011 tonight.

soodoku commented 6 years ago

I think this is done also, right? Should we close this issue @ChrisMuir?

ChrisMuir commented 6 years ago

No unfortunately this isn't done, I'm slowly working through each year. The number of records per year is around 160k, and each record requires a single request to the website, via xml2, so with a small amount of Sys.sleep between each request, the scraping is very slow. Each year is taking about a week to complete, I've gotten 2010 - 2012 done, 2009 is scraping right now. I have records for every year 1999 - 2012.

If you don't think we need to go all the way back to 1999, that's no problem, just let me know.

soodoku commented 6 years ago

Righto! Thanks, man!

I vote for getting all the data. Longitudinal data is great for econometrics. Paired w/ some outcome data (from health to economic outcomes), it can probably lead to imp. insights.

soodoku commented 6 years ago

Even descriptively, it would be great to know how teacher salaries have fared under Republicans, Dems., how close elections affect salaries, and also just how they compare over time to median wage in the respective areas.

ChrisMuir commented 6 years ago

Cool, got it. I'll keep adding data to the repo as each year finishes.

ChrisMuir commented 6 years ago

Quick update, the site has been completely down for the last ~48 hours. No "site maintenance" screen or anything, just a blank white page. I'll keep checking it.

soodoku commented 6 years ago

sigh. just checked. still down.