Open soodoku opened 6 years ago
Have a little free time, am working on this now....what's the url source of this PDF? I don't see it listed in the ID README, and just want to include it in the comments at the top of the script file.
2013 is from: https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf
But as you see on the title page, data are from transparent idaho. will get a link from that site. there are more pdfs like this on the transparent idaho website including for 2018: https://ibis.sco.idaho.gov/pubtrans/workforce/Workforce%20by%20Name%20Summary-en-us.pdf
Cooool, thanks!
Just finished extracting data from the 2013, 2014, and 2018 PDF's, and pushed the 7z files and script files to the repo.
This ended up being a huge pain, for some reason pdftools was working just fine for the 2013 PDF but then just stopped working about a week ago, and from that point on it wouldn't work for any of the ID pdf files. By wouldn't work, I mean pdf_text
would read the correct number of pages in the doc, but would return an empty string for each page. I ended up writing a custom function which mimics pdftools::pdf_text
that calls
system2("pdftotext", args = c("-table", path_to_pdf_file))
which is pretty hacky. I'm working on a PC, I'm not sure if that will work on any other OS.
Also, as of now the three ID script files for each individual PDF are effectively identical, at some point I will replace them with a single script that reads and writes to/from each individual yearly folder.
oy! sorry to hear.
pdftools:
dk on the situation with pdftools but post windows update, some stuff may need admin privs. correctly as the function may be calling something else in the backend. always worth a try to run as admin.
i did notice that my miktext conked out a week ago also. so i had reinstall that and setup path etc. again.
the other alternative to pdftools = abbyyfine reader. they aren't free but they have an API and there is a R wrapper. abbyy is generally considered best in class for commercial OCR.
no worries on the 3 scripts. and congrats on getting across the line on this one! seems v. painful and that is where some new software is born! :-)
Yeah, it's all good. What's weirdest is that I was initially working with the 2013 doc on a Mac, then the issue started happening about a week ago, tested it on my work PC and it was doing the same thing (and is persisting for all of the Idaho pdf docs).....so the pdftools issue is cutting across Mac and PC for me.
Actually, do you mind trying it yourself? Try running:
url <- "https://pibuzz.com/wpcontent/uploads/post%20documents/Idaho%202013.pdf"
txt <- pdftools::pdf_text(url)
and let me know if it works for you. For reference, it reads a single empty string for each page for me....so this resolves to TRUE
for me:
identical(
pdftools::pdf_text("https://pibuzz.com/wp-content/uploads/post%20documents/Idaho%202013.pdf"),
rep("", 1012)
)
#> TRUE
Just let me know what results you get if you don't mind.
dear @ChrisMuir,
reason for delay = URL is now dead. tried on both linux and windows --- same result --- bunch of empty strings.
No worries on delay, thanks for trying and for the heads up!
https://github.com/public-salaries/public_salaries/tree/master/id/2013