Closed rufuspollock closed 8 years ago
This was covered here @tlevine, no? http://okfnlabs.org/blog/2013/12/25/parsing-pdfs.html
@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post ... :-)
Indeed mine references only like half of the things you mention.
ScraperWiki have a proprietary PDF reader that they say is quite good.
On 19 Aug 02:29, Rufus Pollock wrote:
@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post :-)
Reply to this email directly or view it on GitHub: https://github.com/okfn/okfn.github.com/issues/155#issuecomment-132508312
Good point – I’ve added that to the proprietary list.
That's a 404 for me.
Plus:
@tfmorris it won't go live until tomorrow as per the date ;-) Check the commit if you want to review in advance ...
Thanks. Online now. Had to give it a nudge to rebuild.
Write a post reviewing data wrangling tools for PDFs
Text in progress at:
http://pad.okfn.org/p/labs-post-pdf-toolsinlined belowBased on research material in https://github.com/okfn/ideas/issues/52
Questions:
Libraries for Extracting Data and Text from PDFs: A Review
Extracting data from PDFs unfortunately remains a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.
3 categories:
The last case is really a situation for OCR (optical character recognition) so we're going to ignore it here. [should include a short para on OCR too, just to provide an indication of the limits of automated extraction without much pre-processing]
[[TODO: some nice PDF screenshots - perhaps we can reference]]
Generic (PDF -> text)
Tables from PDF
Existing open services
Existing proprietary free or paid-for services
Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview
By Language
@maxogden has this list of Node libraries and tools:
https://gist.github.com/maxogden/5842859
Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247
Other good intros