Detecting Text Data as Table while working with Java.

while passing only first page as command line argument it is able to detect table from the whole text. But when passing the whole document it is also detecting the text as the table. Version: tabula-1.0.2.jar

Java Code:

public static void main(String[] args) throws ParseException
{
    // String commandLineOptions[] = {"-p", "all", "-o", "$tsv"};
    String commandLineOptions[] = {"-p", "1", "-o", "$tsv"};

    CommandLineParser parser = new DefaultParser();
    try
    {
        CommandLine line = parser.parse(buildOptions(), commandLineOptions);
        new CommandLineApp(System.out, line).extractFileInto(
                new File("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf"),
                new File("C:/Users/path to pdf/ast_sci_data_tables_sample.tsv"));
    }
    catch (Exception e)
    {
        e.printStackTrace();
    }
}

Surprisingly while applying same file on python library it is able to detect only tables from the whole pdf.

Python Code:

import tabula
from tabula import read_pdf
from tabula import convert_into
df=read_pdf("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf",multiple_tables=True,pages = 'all')
convert_into("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf","test.json",output_format="json",multiple_tables=True,pages = 'all')

Pdf file :http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf

tabulapdf / tabula-java

Detecting Text Data as Table while working with Java. #289