tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.82k stars 425 forks source link

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513

Closed zwong closed 1 year ago

zwong commented 1 year ago

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. image

However, when using the standalone Tabula tool, the encoding is properly: image

Searching online, I've tried the below with no success

  1. Setting the -Dfile.encoding=utf8
  2. Setting chcp 65001

I understand Tabula and tabula-java use the same library, but is there something different between the two that would explain the difference in output?

zwong commented 1 year ago

After further testing with output to CSV, I found that the gibberish results only happen in tabula-py. Tabula Java appears to output to CSV that is properly encoded. Closing this issue.

jeremybmerrill commented 1 year ago

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

zwong commented 1 year ago

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

Thank you. To work around the issue in tabula-py, I ended up doing similar to what you had suggested and output a CSV that I would read into python. The encoding is correct and it side steps a lot of the issues I faced with trying to import the PDF data directly. With Excel, I learned that I had to explicitly set the encoding otherwise it would just read the data as ANSI (I'm guessing). Next issue is trying to properly import the data into the correct columns so I can start processing it!