Gibberish output in tabula-java for Japanese PDF but works in Tabula

zwong commented 1 year ago

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish.

However, when using the standalone Tabula tool, the encoding is properly:

Searching online, I've tried the below with no success

Setting the -Dfile.encoding=utf8
Setting chcp 65001

I understand Tabula and tabula-java use the same library, but is there something different between the two that would explain the difference in output?

zwong commented 1 year ago

After further testing with output to CSV, I found that the gibberish results only happen in tabula-py. Tabula Java appears to output to CSV that is properly encoded. Closing this issue.

jeremybmerrill commented 1 year ago

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

zwong commented 1 year ago

@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?

Thank you. To work around the issue in tabula-py, I ended up doing similar to what you had suggested and output a CSV that I would read into python. The encoding is correct and it side steps a lot of the issues I faced with trying to import the PDF data directly. With Excel, I learned that I had to explicitly set the encoding otherwise it would just read the data as ANSI (I'm guessing). Next issue is trying to properly import the data into the correct columns so I can start processing it!

tabulapdf / tabula-java

Gibberish output in tabula-java for Japanese PDF but works in Tabula #513