Open vishaln79 opened 6 years ago
The area's selection is defined as coordinates (i.e. x1, x2, y1, and y2) while the page's width and height is defined (I believe). On a side note: the one problem I have with this JSON object is that you have redundant data when you repeat to the selection to multiple pages (does the width and height of the pages ever differ in a document?).
I'm also interested in understanding how to convert the JSON information in command line options.
For example, how should I convert this line:
{"page":1,"extraction_method":"guess","x1":1.1161425018310547,"x2":594.9039534759522,"y1":342.6557480621338,"y2":761.5812337493896,"width":593.7878109741212,"height":418.9254856872559}
in order to call the CL interface with the -a
parameter?
java -jar tabula-1.0.2.jar -b docs -p 1 -a ...
edit: it seems that the exported data needs to be transformed as follows:
-a 342.656,1.116,761.581,594.904
... meaning: y1,x1,y2,x2
@fabcan Refer to CommandLineApp.java: it states that flag 'a' "Accepts top,left,bottom,right". In other words, it takes coordinates as y1,x1,y2,x2. You can further deduce its functionality by investigating the methods that are called when extraction is done via CLI. Specifically, I think the coordinates the user enters are properly mapped in this conditional:
if (pageAreas != null) {
for (Pair<Integer, Rectangle> areaPair : pageAreas) {
Rectangle area = areaPair.getRight();
if (areaPair.getLeft() == RELATIVE_AREA_CALCULATION_MODE) {
area = new Rectangle((float) (area.getTop() / 100 * page.getHeight()),
(float) (area.getLeft() / 100 * page.getWidth()), (float) (area.getWidth() / 100 * page.getWidth()),
(float) (area.getHeight() / 100 * page.getHeight()));
}
tables.addAll(tableExtractor.extractTables(page.getArea(area)));
}
} else {
tables.addAll(tableExtractor.extractTables(page));
}
Incase anyone comes accross this trying to define the extraction area in java code as opposed to with the command line, you can do this:
BasicExtractionAlgorithm extractionAlgorithm = new BasicExtractionAlgorithm(); List<Table> tables = extractionAlgorithm.extract(page.getArea(399.6f, 71.052f, 817.284f, 556.14f))
getArea arguments being top, left, bottom and right coordinates. This thing really needs better documentation.
For example, I am trying to run a batch job based on a Tabula template I had created: {"page":5,"extraction_method":"guess","x1":306.4589953308106,"x2":548.1989953308106,"y1":177.0975,"y2":230.6475,"width":241.74,"height":53.550000000000004}
I am not sure how to specify the area based on this.
Thanks!