okfn / messytables

Tools for parsing messy tabular data. This is now superseded by https://github.com/frictionlessdata/tabulator-py
http://messytables.readthedocs.io/
387 stars 110 forks source link

Support for PDF format #82

Closed fawkesley closed 11 years ago

fawkesley commented 11 years ago

We've been exploring different options for parsing PDFs. Currently we're using an (alpha) in-house library called pdftables (we blogged about it here)

This pull request integrates pdftables into messytables. It is an optional requirement - if pdftables is not installed, messytables will work as usual and the PDF tests will be skipped.

We're looking into other ways of extracting tables from PDFs, but either way we'll need the messytables integration.

rossjones commented 11 years ago

Think you need to add pdf tables to the test requirements file, assuming it's on pypi.

rossjones commented 11 years ago

Sorry you might need to rebase since I merged #81. I'm interested in @domoritz's opinion on this one :)

domoritz commented 11 years ago

My opinion is that you should never, ever change the history of something in the main repo (not even on a branch). Better create a new pr. However, I'm for rebasing on external branches or private branches because this keeps the history cleaner.

rossjones commented 11 years ago

I meant opinion on the feature, not on rebasing on their private branch ;)

domoritz commented 11 years ago

Ahh. IMHO, parsing tables in PDFs is super difficult but would be really awesome. As long as someone who just wants simple csv parsing does not have to install pdfminer and everything, I am for this feature.

@rossjones We talked about this before: I think we should move the requirements, that are only important for certain features, to a requirements.text file.

fawkesley commented 11 years ago

@domoritz Agreed on it being super difficult. We'll stick to this approach of PDF support being optional.

rossjones commented 11 years ago

I agree, as long as it is only the optional requirements rather than the core ones I am all for it.

Also @paulfurley don't forget the changelog ;)

fawkesley commented 11 years ago

I'll get pdftables working on python 2.6 now and I'll give you a shout once I've rebased and modded the changelog :)

fawkesley commented 11 years ago

OK, tests passing and rebased, think we're good to go :) @rossjones