Open soodoku opened 7 years ago
FYI, the service at the link in your post is unavailable, see screenshot. No idea if it's just temporarily down, but figured I'd document it.
temporary outage. i can see it.
I manually pulled the latest PDF's from 2017-11-15 and wrote a script to read them all in, extract the data, merge into a single data frame, and write to csv. Next step I will write a scraper to automate the process of pulling all PDF's from all time frames, and try to apply the aggregation script to all of the PDF's.
I pushed the 2017-11-15 data (raw PDF's and 7z of output df) and aggregation script to the repo.
Nice man! Really cool! :-)
hey @ChrisMuir --- should we close this issue?
Ah, I haven't yet completed the next step (write script to pull all of the PDF's from the website). We have a plan in place, but I don't know if that's enough to close this issue or if you want to wait until all of the PA work has been completed...it's up to you.
Thanks, man! Let's wait.
Source =
http://pennwatch.pa.gov/employees/Pages/Employee-Salaries.aspx
To scrape, use the form to subset salary by $0---$25,000 etc.
Add columns indicating year and month to each row scraped. Also add department to each row. It comes up as a title.