PDF cracking - Githubissues

sketch-city / project-ideas

Running list of all project ideas - pick one and run with it!

http://sketch-city.github.io/project-ideas/

89 stars 7 forks source link

PDF cracking #48

Open fileunderjeff opened 8 years ago

fileunderjeff commented 8 years ago

I'd like to set up a project that:

1- identifies high value datasets currently in PDF form 2- converts them into a better format (csv, json, shapefile, etc.) 3- publishes them on the houston open data portal 4- publishes a methodology for repeating this process

There's so much valuable stuff in PDF form. The fun part would be rounding it all up and figuring out what's important.

juyeongkim commented 8 years ago

tabula could be used for this project? There is even a R binder to the Tabula java library now.

randy7771026 commented 8 years ago

@manuelaristaran

OilGasDataAnalyst commented 8 years ago

I'm interested in this

ardouglass commented 8 years ago

Another tool I've seen for this (so it can be done programmatically) is https://www.npmjs.com/package/pdf2csv. Unless something has changed, it's being used in production for the Atlanta Courts website on their find my court case page.

fileunderjeff commented 8 years ago

Once a dataset is created, please reach out to the City's open data team (see #54) and let them know to add it to the open data portal!

GeoffreyPS commented 8 years ago

I would like to get involved in cracking these-- either over the 2016 Hackathon or throughout the year. Do we know anyone who is able to identify the datasets? It seems like the first step is getting a running list of desirable PDFs and their URIs.

fileunderjeff commented 8 years ago

List of publicly-accessible PDFs on houstontx.gov: https://www.google.im/search?as_q=&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=houstontx.gov&as_occt=any&safe=images&as_filetype=pdf&as_rights=#q=*+site:houstontx.gov+filetype:pdf&as_qdr=all&filter=0

Need to grab all of these and review the ones that are useful, catalog the ones that are trashed, and come up with an initial batch to crack.