I'm trying to use tabula to extract the information from a pdf-file consisting of one large 3 column table. In all there are 440 pages consisting of ~50 rows each, so perhaps a little less than 22000 rows all together.
The main problem is that tabula misses large chunks of the table, skipping pages of the pdf-file. I end up with around 1/3 as many table rows as expected.
There are also problems with some rows (maybe about 2%???) where the third column entry is merged into the end of the second column. I think this is caused when entries in the original table second column were too wide for space provided in the A4-pdf and presumably originally 'overlapped' the third column. (It would be nice if I didn't have to correct these by hand, but .. not a make or break issue for me.)
Hi!
I'm trying to use tabula to extract the information from a pdf-file consisting of one large 3 column table. In all there are 440 pages consisting of ~50 rows each, so perhaps a little less than 22000 rows all together.
The main problem is that tabula misses large chunks of the table, skipping pages of the pdf-file. I end up with around 1/3 as many table rows as expected.
There are also problems with some rows (maybe about 2%???) where the third column entry is merged into the end of the second column. I think this is caused when entries in the original table second column were too wide for space provided in the A4-pdf and presumably originally 'overlapped' the third column. (It would be nice if I didn't have to correct these by hand, but .. not a make or break issue for me.)
Any help or pointers would be extremely welcome.
I attach the pdf and the output I get. tabula-Qualis novos - Julho de 2019.csv
Qualis novos - Julho de 2019.pdf