Closed zwong closed 1 year ago
After further testing with output to CSV, I found that the gibberish results only happen in tabula-py. Tabula Java appears to output to CSV that is properly encoded. Closing this issue.
@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?
@zwong glad you figured it out. The combination of Windows, Command Prompt, Java and tabula-py is a complicated one! I don't really anymore remember the wizardry needed to make the Windows command prompt cooperate. Have you tried havnig tabula-py output a CSV? I wonder if the CSV is correct, but that the program you use to open the CSV (e.g. Excel) is incorrectly guessing the encoding?
Thank you. To work around the issue in tabula-py, I ended up doing similar to what you had suggested and output a CSV that I would read into python. The encoding is correct and it side steps a lot of the issues I faced with trying to import the PDF data directly. With Excel, I learned that I had to explicitly set the encoding otherwise it would just read the data as ANSI (I'm guessing). Next issue is trying to properly import the data into the correct columns so I can start processing it!
I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish.
However, when using the standalone Tabula tool, the encoding is properly:
Searching online, I've tried the below with no success
I understand Tabula and tabula-java use the same library, but is there something different between the two that would explain the difference in output?