tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 429 forks source link

large chunks of table missed; cols 2&3 merged in some rows #483

Open sunsetsandplants opened 2 years ago

sunsetsandplants commented 2 years ago

Hi!

I'm trying to use tabula to extract the information from a pdf-file consisting of one large 3 column table. In all there are 440 pages consisting of ~50 rows each, so perhaps a little less than 22000 rows all together.

The main problem is that tabula misses large chunks of the table, skipping pages of the pdf-file. I end up with around 1/3 as many table rows as expected.

There are also problems with some rows (maybe about 2%???) where the third column entry is merged into the end of the second column. I think this is caused when entries in the original table second column were too wide for space provided in the A4-pdf and presumably originally 'overlapped' the third column. (It would be nice if I didn't have to correct these by hand, but .. not a make or break issue for me.)

Any help or pointers would be extremely welcome.

I attach the pdf and the output I get. tabula-Qualis novos - Julho de 2019.csv

Qualis novos - Julho de 2019.pdf