tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.82k stars 425 forks source link

Facing difficulty in extracting unruled tables #197

Open ghost opened 6 years ago

ghost commented 6 years ago

I am working with this Tabula Api.I am writing the code in java to extract the tables from any pdf using this API.I tried my code on several files. But I am only able to extract tables from a file,that has ruled tables.I tried using both SpreadsheetExtractionAlgorithm and BasicExtractionAlgorithm,but none of them produced the desired results.I am sharing a sample pdf file in which I am unable to detect the tables,and the source code I have written in Java to extract the tables(the file is in txt format as it cannot be submitted here in Java). TableConverter.txt abc.pdf

criztovyl commented 6 years ago

Do you get the desired results when you try to use Tabulas integrated CLI?

ghost commented 6 years ago

@criztovyl Yes I checked,but it is not giving the desired results at all times.

criztovyl commented 6 years ago

But I am only able to extract tables from a file, that has ruled tables.

Tabula can only handle ruled tables, without them it can't do it's job, therefore I think this is not an issue with Tabula.

I took a look at your Random Numbers file anyway and the tables seem to be text, separated by tabs and/or other white space. Now the question is what data you want: Are the numbers itself enough or do you need them to be in a table? (I suspect you need a table b/c Tabula is for tables.)

If you only need the numbers you could use pdftotext (Debian has it in poppler-utils) and delete all the lines you do not need. If you need a table with a bit of scripting in your language of choice you should be able to construct tables from the extracted text.