tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 430 forks source link

Detecting Text Data as Table while working with Java. #289

Open urmay opened 5 years ago

urmay commented 5 years ago

while passing only first page as command line argument it is able to detect table from the whole text. But when passing the whole document it is also detecting the text as the table. Version: tabula-1.0.2.jar

Java Code:

public static void main(String[] args) throws ParseException
{
    // String commandLineOptions[] = {"-p", "all", "-o", "$tsv"};
    String commandLineOptions[] = {"-p", "1", "-o", "$tsv"};

    CommandLineParser parser = new DefaultParser();
    try
    {
        CommandLine line = parser.parse(buildOptions(), commandLineOptions);
        new CommandLineApp(System.out, line).extractFileInto(
                new File("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf"),
                new File("C:/Users/path to pdf/ast_sci_data_tables_sample.tsv"));
    }
    catch (Exception e)
    {
        e.printStackTrace();
    }
}

Surprisingly while applying same file on python library it is able to detect only tables from the whole pdf.

Python Code:

import tabula
from tabula import read_pdf
from tabula import convert_into
df=read_pdf("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf",multiple_tables=True,pages = 'all')
convert_into("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf","test.json",output_format="json",multiple_tables=True,pages = 'all') 

Pdf file :http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf

satyaraj479 commented 4 years ago

Use Nurminen detection algorithm to detect only tables, and the you can use BasicExtractorAlgorthm to extract the data into required format like csv, html.