Open fileunderjeff opened 8 years ago
tabula could be used for this project? There is even a R binder to the Tabula java library now.
@manuelaristaran
I'm interested in this
Another tool I've seen for this (so it can be done programmatically) is https://www.npmjs.com/package/pdf2csv. Unless something has changed, it's being used in production for the Atlanta Courts website on their find my court case page.
Once a dataset is created, please reach out to the City's open data team (see #54) and let them know to add it to the open data portal!
I would like to get involved in cracking these-- either over the 2016 Hackathon or throughout the year. Do we know anyone who is able to identify the datasets? It seems like the first step is getting a running list of desirable PDFs and their URIs.
List of publicly-accessible PDFs on houstontx.gov: https://www.google.im/search?as_q=&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=&as_qdr=all&as_sitesearch=houstontx.gov&as_occt=any&safe=images&as_filetype=pdf&as_rights=#q=*+site:houstontx.gov+filetype:pdf&as_qdr=all&filter=0
Need to grab all of these and review the ones that are useful, catalog the ones that are trashed, and come up with an initial batch to crack.
I'd like to set up a project that:
1- identifies high value datasets currently in PDF form 2- converts them into a better format (csv, json, shapefile, etc.) 3- publishes them on the houston open data portal 4- publishes a methodology for repeating this process
There's so much valuable stuff in PDF form. The fun part would be rounding it all up and figuring out what's important.