tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 430 forks source link

How are folks here specifying the areas to extract in tabula-java? #248

Open vishaln79 opened 6 years ago

vishaln79 commented 6 years ago

For example, I am trying to run a batch job based on a Tabula template I had created: {"page":5,"extraction_method":"guess","x1":306.4589953308106,"x2":548.1989953308106,"y1":177.0975,"y2":230.6475,"width":241.74,"height":53.550000000000004}

I am not sure how to specify the area based on this.

Thanks!

rosenjcb commented 6 years ago

The area's selection is defined as coordinates (i.e. x1, x2, y1, and y2) while the page's width and height is defined (I believe). On a side note: the one problem I have with this JSON object is that you have redundant data when you repeat to the selection to multiple pages (does the width and height of the pages ever differ in a document?).

ghost commented 6 years ago

I'm also interested in understanding how to convert the JSON information in command line options. For example, how should I convert this line: {"page":1,"extraction_method":"guess","x1":1.1161425018310547,"x2":594.9039534759522,"y1":342.6557480621338,"y2":761.5812337493896,"width":593.7878109741212,"height":418.9254856872559} in order to call the CL interface with the -a parameter? java -jar tabula-1.0.2.jar -b docs -p 1 -a ...

edit: it seems that the exported data needs to be transformed as follows: -a 342.656,1.116,761.581,594.904 ... meaning: y1,x1,y2,x2

rosenjcb commented 6 years ago

@fabcan Refer to CommandLineApp.java: it states that flag 'a' "Accepts top,left,bottom,right". In other words, it takes coordinates as y1,x1,y2,x2. You can further deduce its functionality by investigating the methods that are called when extraction is done via CLI. Specifically, I think the coordinates the user enters are properly mapped in this conditional:

if (pageAreas != null) {
    for (Pair<Integer, Rectangle> areaPair : pageAreas) {
        Rectangle area = areaPair.getRight();
        if (areaPair.getLeft() == RELATIVE_AREA_CALCULATION_MODE) { 
            area  = new Rectangle((float) (area.getTop() / 100 * page.getHeight()),
            (float) (area.getLeft() / 100 * page.getWidth()), (float) (area.getWidth() / 100 * page.getWidth()),
            (float) (area.getHeight() / 100 * page.getHeight()));                            
            }
    tables.addAll(tableExtractor.extractTables(page.getArea(area)));
    }
} else {
    tables.addAll(tableExtractor.extractTables(page));
}
KalebTeixeira commented 2 years ago

Incase anyone comes accross this trying to define the extraction area in java code as opposed to with the command line, you can do this: BasicExtractionAlgorithm extractionAlgorithm = new BasicExtractionAlgorithm(); List<Table> tables = extractionAlgorithm.extract(page.getArea(399.6f, 71.052f, 817.284f, 556.14f))

getArea arguments being top, left, bottom and right coordinates. This thing really needs better documentation.