tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 429 forks source link

RFC: Exposing an API for templated extraction #10

Open jazzido opened 9 years ago

jazzido commented 9 years ago

Extracting tables with a predefined template or stencil is an frequently requested feature for Tabula. Some use cases:

I've implemented (4439b57e8e19b92792b577747ad1551144ad8ec7) a new method in SpreadsheetExtractor that, instead of building sets of cells ("spreadsheets") from the ruling lines contained in a Page, takes a List<Ruling> as a parameter. That would allow us to expose a feature in the command line tool and on an HTTP API that takes a structure such as:

{ "page": 1,
  "regions": [
    { 
      "name": "table_foo",
     "coords: { "x1": ..., "y1": ..., "width": ..., "height": ... },
      "rulings": [ {"x1": 320.0, "y1": 285.0, "x2": 564.4409, "y2": 285.0}, .... ]
    }
   ]
}

In the context of Tabula, we would be adding a rulings key to the extraction parameters that it sends to the server to include data about the separators. In the context of the command line application, it could accept a JSON file with the specification.

Having a GUI for this feature would be awesome. There's a tabula-table-editor that I started last year, that could be integrated into Tabula for adjusting the detected tabular structures: http://dump.jazzido.com/tabula-table-editor/

Additionally, the BasicExtractionAlgorithm can use this new method as its extraction backend, by building a List<Ruling> from the detected lines and columns.

@jeremybmerrill, @mtigas: Would love to hear your thoughts about this.

Cheers!

jeremybmerrill commented 9 years ago

This is awesome and I'm excited about getting it built. I think we were talking with @floodfish during one of our meetings a while ago thinking that we're at a stage that this much-requested feature is doable. We could even make it possible to "save" a "template"...

That said, I want to get this feature/newUI version shipped ASAP. There are, I think, a lot of improvements that I have personally been using for MONTHS that our users deserve to have.

jazzido commented 9 years ago

Absolutely, no need to wait for this to release newUI. If anything, this can progress in parallel and have it as a command line thing for those who want to use tabula-java.

jeremybmerrill commented 9 years ago

👍👍

On Mon, Mar 9, 2015 at 10:50 PM, Manuel Aristarán notifications@github.com wrote:

Absolutely, no need to wait for this to release newUI. If anything, this can progress in parallel and have it as a command line thing for those who want to use tabula-java.

— Reply to this email directly or view it on GitHub https://github.com/tabulapdf/tabula-java/issues/10#issuecomment-77986898 .

floodfish commented 9 years ago

Yeah, I think we even discussed saving that feature for later. Definitely shouldn't hold up launch of the good stuff we have

paulohpcardoso commented 9 years ago

Hello , does anyone know tell me if had any advance in this tabaular structure? http://dump.jazzido.com/tabula-table-editor/ How can I contribute to advance in this structure? Thanks.

jazzido commented 9 years ago

Hi @paulohpcardoso,

It would be absolutely fantastic if you could help us finish and integrate the table editor. The link that you mentioned contains the code from the table_editor branch of the tabula_table_editor repo. I spent quite a bit of time working on that more than a year ago, but we never had the time to fully integrate it with Tabula.

The purpose of that tool is to generate a set of ruling lines that would be passed as an argument to public List<? extends Table> extract(Page page, List<Ruling> rulings) in the SpreadsheetExtractionAlgorithm class.

If you want to attempt that, I would be more than happy to help you navigate both the code of the table_editor tool, tabula-java and then we could work on integrating with the Tabula tool.

Thanks for your interest on this!