okfn / okfn.github.com

Open Knowledge Labs website (and general issue tracker).
http://okfnlabs.org
80 stars 60 forks source link

[Post] Review of PDF data wrangling tools #155

Closed rufuspollock closed 8 years ago

rufuspollock commented 10 years ago

Write a post reviewing data wrangling tools for PDFs

Text in progress at: http://pad.okfn.org/p/labs-post-pdf-tools inlined below

Based on research material in https://github.com/okfn/ideas/issues/52

Questions:

Extracting data from PDFs unfortunately remains a common data wrangling task. This post reviews various tools and services for doing this with a focus on free (and preferably) open source options.

3 categories:

The last case is really a situation for OCR (optical character recognition) so we're going to ignore it here. [should include a short para on OCR too, just to provide an indication of the limits of automated extraction without much pre-processing]

[[TODO: some nice PDF screenshots - perhaps we can reference]]

Generic (PDF -> text)

Google app engine used to do this http://developers.google.com/appengine/docs/python/conversion/overview

By Language

@maxogden has this list of Node libraries and tools:

https://gist.github.com/maxogden/5842859

Here's a gist showing how to use pdf2json: https://gist.github.com/rgrp/5944247

Other good intros

danfowler commented 9 years ago

This was covered here @tlevine, no? http://okfnlabs.org/blog/2013/12/25/parsing-pdfs.html

rufuspollock commented 9 years ago

@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post ... :-)

tlevine commented 9 years ago

Indeed mine references only like half of the things you mention.

ScraperWiki have a proprietary PDF reader that they say is quite good.

On 19 Aug 02:29, Rufus Pollock wrote:

@danfowler @tlevine's article was a great specific walkthrough - like a tutorial. This was more of a "complete" review of the tools out there so I think a bit different. We should finish the pad and then post :-)


Reply to this email directly or view it on GitHub: https://github.com/okfn/okfn.github.com/issues/155#issuecomment-132508312

andylolz commented 9 years ago

Good point – I’ve added that to the proprietary list.

rufuspollock commented 8 years ago

FIXED. http://okfnlabs.org/blog/2016/04/19/pdf-tools-extract-text-and-data-from-pdfs.html

tfmorris commented 8 years ago

That's a 404 for me.

Plus:

rufuspollock commented 8 years ago

@tfmorris it won't go live until tomorrow as per the date ;-) Check the commit if you want to review in advance ...

danfowler commented 8 years ago

Thanks. Online now. Had to give it a nudge to rebuild.