while passing only first page as command line argument it is able to detect table from the whole text. But when passing the whole document it is also detecting the text as the table.
Version: tabula-1.0.2.jar
Java Code:
public static void main(String[] args) throws ParseException
{
// String commandLineOptions[] = {"-p", "all", "-o", "$tsv"};
String commandLineOptions[] = {"-p", "1", "-o", "$tsv"};
CommandLineParser parser = new DefaultParser();
try
{
CommandLine line = parser.parse(buildOptions(), commandLineOptions);
new CommandLineApp(System.out, line).extractFileInto(
new File("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf"),
new File("C:/Users/path to pdf/ast_sci_data_tables_sample.tsv"));
}
catch (Exception e)
{
e.printStackTrace();
}
}
Surprisingly while applying same file on python library it is able to detect only tables from the whole pdf.
Python Code:
import tabula
from tabula import read_pdf
from tabula import convert_into
df=read_pdf("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf",multiple_tables=True,pages = 'all')
convert_into("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf","test.json",output_format="json",multiple_tables=True,pages = 'all')
Use Nurminen detection algorithm to detect only tables, and the you can use BasicExtractorAlgorthm to extract the data into required format like csv, html.
while passing only first page as command line argument it is able to detect table from the whole text. But when passing the whole document it is also detecting the text as the table. Version: tabula-1.0.2.jar
Surprisingly while applying same file on python library it is able to detect only tables from the whole pdf.
Pdf file :http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf