Open soodoku opened 7 years ago
@soodoku I got started on this, I've written a script that will get all of the district links for each year, and then for each set of district links it will go through each one and get all of the teacher links. I've gotten all of the district links, but have only gotten the teacher links for the first year's worth of districts (2012), because the sheer number of requests being made to the website is starting to grow exponentially. The number of years is 14, number of district links is 14816, number of teacher links for only 2012 is 162471. Pulling the teacher links for 2012 (14816 total requests) takes about a half hour.....grabbing all of the data from each teacher link is going to take 70 - 80 hours of constantly pinging the website... ((162K 14) / 15K) 0.5h = 75h.
I've written code to scrape all the teacher links, iterate over them and grab the meta data for each link, but I don't know if I feel right about blasting their server to the level that it would take to get all that data.
For now, I'll push the script file to the repo.
I see. Worry about putting load on their server is reasonable. Two aspects to that:
But still makes sense to a) go year by year, and b) do Sys.sleep(1) between requests.
We can also email them to ask for the data. I am not v. optimistic that we will get something.
What do you think?
Yeah I have calls to Sys.sleep
between each request. The robots.txt file doesn't mention /ftf/
....I guess it's fine, as long as we go easy on them. It'll take some time to complete, I can probably set up a single year to run over night, and just do that until we have each of the years done.
I'm going out of town Wednesday morning thru Monday, so I probably won't be able to start that process until next week.
If you or anyone else on the team starts it and have questions about the code, feel free to let me know.
So I worked on this some today, quick update....I let the script run overnight to scrape all of the teacher links from each of the district links. There's a total of 2,226,915 unique teacher links (keep in mind, one teacher can have multiple links, as the data is split up by year). Assuming 0.5 seconds of computation/rvest time per request, and including a Sys.sleep of 2 seconds per request, we're looking at 1546.46 hours, or 64.4 days, of non-stop scrape time required to complete the task.
Should we consider narrowing our focus on this? Maybe limit the years to the five most recent years (2008 - 2012), or filter the 2mil+ links to only include unique teachers (keeping only the most recent instance of each teacher)?
Let me know what you think. I made a few general refactor edits to the scrape code today, I'll push that to the repo now.
Awesome @ChrisMuir!
2.2M teacher-years is a lot! I agree that we should start out small. Probably do 2012 first and then go back in time slowly. One year at one time makes sense to me. And we can do it over next many ways.
p.s. There are some odd things in the data including $0 salaries.
Cool, yeah I'm letting it run on the 2012 teacher links for now. Once that's done, I'll write those results to csv and upload to the repo. We can take a look at that data and decide what to do from there. Thanks!
Just pushed the 2012 IL teacher salaries to the repo. The data came out very clean from the website, there were over 162K records scraped and every single one returned as a neat 10 variable data frame, all with the same col headers. It made binding them all up into a single data set headache-free.
I'll start the script on 2011 tonight.
I think this is done also, right? Should we close this issue @ChrisMuir?
No unfortunately this isn't done, I'm slowly working through each year. The number of records per year is around 160k, and each record requires a single request to the website, via xml2
, so with a small amount of Sys.sleep
between each request, the scraping is very slow. Each year is taking about a week to complete, I've gotten 2010 - 2012 done, 2009 is scraping right now. I have records for every year 1999 - 2012.
If you don't think we need to go all the way back to 1999, that's no problem, just let me know.
Righto! Thanks, man!
I vote for getting all the data. Longitudinal data is great for econometrics. Paired w/ some outcome data (from health to economic outcomes), it can probably lead to imp. insights.
Even descriptively, it would be great to know how teacher salaries have fared under Republicans, Dems., how close elections affect salaries, and also just how they compare over time to median wage in the respective areas.
Cool, got it. I'll keep adding data to the repo as each year finishes.
Quick update, the site has been completely down for the last ~48 hours. No "site maintenance" screen or anything, just a blank white page. I'll keep checking it.
sigh. just checked. still down.
Source =
http://www.familytaxpayers.org/ftf/ftf_salaries.php
For each year, list all districts. Each school in the district brings you to a clickable list of teacher and salary. Each teacher's name is clickable and gets meta data on the teacher.
Useful to produce year by year lists for now. We can merge later.