tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.82k stars 425 forks source link

Unable to Process Images #495

Closed beerdedfellow closed 1 year ago

beerdedfellow commented 2 years ago

Hello,

I am currently attempting to use tabula-py to process PDF files that contain an image. However, whenever I attempt to read the document with tabula.read_pdf, I get an error similar to the following:

(3.9.10/envs/alert_logic) jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/github/Alert-Logic-Billing git:(main*) $ ./app.py 
Got stderr: Sep 07, 2022 3:09:16 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

So to ensure that tabula-py was using a version of tabula-java that was compiled with these dependencies, I cloned the tabula-java repo and built the .jar manually:

jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/github/tabula-java git:(master) $ tail -15 pom.xml 

    <dependency>
        <groupId>com.github.jai-imageio</groupId>
        <artifactId>jai-imageio-jpeg2000</artifactId>
        <version>1.4.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>jbig2-imageio</artifactId>
        <version>3.0.4</version>
    </dependency>
</dependencies>

</project>

jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/github/tabula-java git:(master) $ mvn clean compile assembly:single
[INFO] Scanning for projects...
[INFO] Inspecting build with total of 1 modules...
[INFO] Installing Nexus Staging features:
[INFO]   ... total of 1 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO] 
[INFO] ----------------------< technology.tabula:tabula >----------------------
[INFO] Building Tabula 1.0.6-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
..........
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.344 s
[INFO] Finished at: 2022-09-07T15:22:10-05:00
[INFO] ------------------------------------------------------------------------

Then I moved this .jar in the path of the tabula-py site-packages:

jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/.pyenv/versions/3.9.10/envs/alert_logic/lib/python3.9/site-packages $ cp ~/github/tabula-java/target/tabula-1.0.6-SNAPSHOT-jar-with-dependencies.jar tabula/tabula-1.0.5-jar-with-dependencies.jar

jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/.pyenv/versions/alert_logic/lib/python3.9/site-packages/tabula $ grep -A1 "TABULA_JAVA_VERSION" io.py 
TABULA_JAVA_VERSION = "1.0.5"
JAR_NAME = "tabula-{}-jar-with-dependencies.jar".format(TABULA_JAVA_VERSION)
JAR_DIR = os.path.abspath(os.path.dirname(__file__))

(3.9.10/envs/alert_logic) jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/github/Alert-Logic-Billing git:(main*) $ ./app.py 
Got stderr: Sep 07, 2022 3:32:05 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Sep 07, 2022 3:32:05 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Sep 07, 2022 3:32:05 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException

(3.9.10/envs/alert_logic) jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/github/Alert-Logic-Billing git:(main*) $ python3
Python 3.9.10 (main, Feb 15 2022, 11:15:40) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import tabula
>>> tabula.environment_info()
Python version:
    3.9.10 (main, Feb 15 2022, 11:15:40) 
[Clang 12.0.0 (clang-1200.0.32.29)]
Java version:
    java version "1.8.0_341"
Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)
tabula-py version: 2.5.1
platform: macOS-10.16-x86_64-i386-64bit
uname:
    uname_result(system='Darwin', node='RXT-Macbook-Pro-2.local', release='21.6.0', version='Darwin Kernel Version 21.6.0: Sat Jun 18 17:07:25 PDT 2022; root:xnu-8020.140.41~1/RELEASE_X86_64', machine='x86_64')
linux_distribution: ('Darwin', '21.6.0', '')
mac_ver: ('10.16', ('', '', ''), 'x86_64')

I also attempted a manual download of 1.0.5 from github but got the same results:

jona5523@RXT-Macbook-Pro-2.local:/Users/jona5523/.pyenv/versions/alert_logic/lib/python3.9/site-packages/tabula $ wget https://github.com/tabulapdf/tabula-java/releases/download/v1.0.5/tabula-1.0.5-jar-with-dependencies.jar -qO tabula-1.0.5-jar-with-dependencies.jar

I am unsure why I am continuing to get this error even though I ensured that tabula-py is using a tabula-java JAR that was compiled with the proper dependencies. Appreciate any help you can provide!

sonu-gupta commented 1 year ago

On the same boat. Were you able to resolve it?

jazzido commented 1 year ago

Hi @sonu-gupta. Do you get an exception, or just a warning? tabula-java Tables contained in images are not read by Tabula anyways, so an exception of that kind can be safely ignored.