public-salaries / public_salaries

Public sector employee salaries
16 stars 1 forks source link

Idaho 2013 pdf to CSV #23

Open soodoku opened 6 years ago

soodoku commented 6 years ago

https://github.com/public-salaries/public_salaries/tree/master/id/2013

ChrisMuir commented 6 years ago

Have a little free time, am working on this now....what's the url source of this PDF? I don't see it listed in the ID README, and just want to include it in the comments at the top of the script file.

soodoku commented 6 years ago

2013 is from: https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf

But as you see on the title page, data are from transparent idaho. will get a link from that site. there are more pdfs like this on the transparent idaho website including for 2018: https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf

soodoku commented 6 years ago

2014 here: http://mediad.publicbroadcasting.net/p/kisu/files/workforce.pdf

ChrisMuir commented 6 years ago

Cooool, thanks!

ChrisMuir commented 6 years ago

Just finished extracting data from the 2013, 2014, and 2018 PDF's, and pushed the 7z files and script files to the repo.

This ended up being a huge pain, for some reason pdftools was working just fine for the 2013 PDF but then just stopped working about a week ago, and from that point on it wouldn't work for any of the ID pdf files. By wouldn't work, I mean pdf_text would read the correct number of pages in the doc, but would return an empty string for each page. I ended up writing a custom function which mimics pdftools::pdf_text that calls

system2("pdftotext", args = c("-table", path_to_pdf_file))

which is pretty hacky. I'm working on a PC, I'm not sure if that will work on any other OS.

Also, as of now the three ID script files for each individual PDF are effectively identical, at some point I will replace them with a single script that reads and writes to/from each individual yearly folder.

soodoku commented 6 years ago

oy! sorry to hear.

pdftools:

  1. dk on the situation with pdftools but post windows update, some stuff may need admin privs. correctly as the function may be calling something else in the backend. always worth a try to run as admin.

  2. i did notice that my miktext conked out a week ago also. so i had reinstall that and setup path etc. again.

  3. the other alternative to pdftools = abbyyfine reader. they aren't free but they have an API and there is a R wrapper. abbyy is generally considered best in class for commercial OCR.

no worries on the 3 scripts. and congrats on getting across the line on this one! seems v. painful and that is where some new software is born! :-)

ChrisMuir commented 6 years ago

Yeah, it's all good. What's weirdest is that I was initially working with the 2013 doc on a Mac, then the issue started happening about a week ago, tested it on my work PC and it was doing the same thing (and is persisting for all of the Idaho pdf docs).....so the pdftools issue is cutting across Mac and PC for me.

Actually, do you mind trying it yourself? Try running:

url <- "https://pibuzz.com/wpcontent/uploads/post%20documents/Idaho%202013.pdf"
txt <- pdftools::pdf_text(url)

and let me know if it works for you. For reference, it reads a single empty string for each page for me....so this resolves to TRUE for me:

identical(
  pdftools::pdf_text("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf"), 
  rep("", 1012)
)
#> TRUE

Just let me know what results you get if you don't mind.

soodoku commented 6 years ago

dear @ChrisMuir,

reason for delay = URL is now dead. tried on both linux and windows --- same result --- bunch of empty strings.

ChrisMuir commented 6 years ago

No worries on delay, thanks for trying and for the heads up!