tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.84k stars 428 forks source link

Gibberish in output #489

Open kevinburke opened 2 years ago

kevinburke commented 2 years ago

I'm using Tabula for Mac. We are trying to export the tables in the attached PDF. concord_housing_table.pdf

The initial upload generated a lot of overlapping selections. We removed all of them except for the selections that covered the entire table row.

When we go to export, the output looks like complete gibberish:

Export Data | Tabula 2022-08-15 11-07-17

We're confused about this, because clearly it's meaningful gibberish - the number of gibberish characters corresponds to the text in the original file. Maybe we missed an encoding setting? We tried using the tools in the app but didn't see anything meaningful.

jeremybmerrill commented 2 years ago

Hi @kevinburke nice to see you here :)

This is almost certainly an issue in how pdfbox, the library Tabula uses to interact at a low-level with the PDF, handles PDFs generated in weird ways. The best fix is to re-encode the PDF with pdftk or Acrobat or a tool of your choice. That generally fixes things.

jazzido commented 2 years ago

It could also be a subsetted-font, which is essentially a non-standard encoding. See this StackOverflow answer.