tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.84k stars 429 forks source link

Last rows of table content not extracted. #219

Open micklegill opened 6 years ago

micklegill commented 6 years ago

While extracting table using lattice extraction last rows of table are not detected. I am posting my pdf file along with command used. Command Used : java -jar tabula-1.0.1-jar-with-dependencies.jar -l -p 2 Tables.pdf -o t.csv Tables.pdf

sandeepsharma-kgp commented 4 years ago

Getting same issue. Is it getting resolved any sooner?

hs-neax commented 3 years ago

I have the same issue. In code extending the Rectangle's height slightly seems to "fix" the issue (I'm using BottomMargin=2):

NurminenDetectionAlgorithm nda = new NurminenDetectionAlgorithm();
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
PageIterator pages = extractor.extract();
List<Table> tables = new ArrayList<Table>();
    while (pages.hasNext()) {
        Page page = pages.next(); 
        List<Rectangle> areas = nda.detect(page);
        for (Rectangle a : areas) {
            a.setBottom((a.getBottom()+BottomMargin)); // FIXME: Extend Rectangle by 2pt down to read last row 
            Page sub_page = page.getArea(a);
            tables.addAll(sea.extract(sub_page))