tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.82k stars 425 forks source link

Getting "subprocess" error for the same PDF files which are working fine with Tabula in local machine. #540

Open deepakdhiman7 opened 7 months ago

deepakdhiman7 commented 7 months ago

We are getting below "subprocess" error, when we are running code in container. In local machine, however it is working fine. We had installed Tabula on local machine an year back. Even in container, it was working fine until this week. Attaching PDFs as well for which it is failing. Versions of packages mentioned below. Can it be PDF files although for same version they are running in local machine? or Environments? Although we checked, there has been no update in environments permissions etc.

PDFs: IONIS Registartion document (002).pdf test_Vinayak.pdf [Uploading Annual_Report.pdf…]()

Package Versions: (llms) dd00740409@ns3067540:~$ java -version openjdk version "1.8.0_312" OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07) OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

(llms) dd00740409@ns3067540:~$ python Python 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:06:00) [GCC 11.4.0] on linux

Error: subprocess.CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', '/usr/local/lib/python3.8/site-packages/tabula/tabula-1.0.5-jar-with-dependencies.jar', '--pages', '9', '--stream', '--guess', '--format', 'JSON', 'Roa8dvYUVmHQLKhhvTiPL.pdf']' returned non-zero exit status 1.

Logs: Exception in thread "main" java.lang.UnsatisfiedLinkError: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjavajpeg.so: libjpeg.so.8: cannot open shared object file: No such file or directory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1838) at java.lang.Runtime.loadLibrary0(Runtime.java:843) at java.lang.System.loadLibrary(System.java:1136) at com.sun.imageio.plugins.jpeg.JPEGImageReader$1.run(JPEGImageReader.java:92) at com.sun.imageio.plugins.jpeg.JPEGImageReader$1.run(JPEGImageReader.java:90) at java.security.AccessController.doPrivileged(Native Method) at com.sun.imageio.plugins.jpeg.JPEGImageReader.<clinit>(JPEGImageReader.java:89) at com.sun.imageio.plugins.jpeg.JPEGImageReaderSpi.createReaderInstance(JPEGImageReaderSpi.java:85) at javax.imageio.spi.ImageReaderSpi.createReaderInstance(ImageReaderSpi.java:320) at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:529) at javax.imageio.ImageIO$ImageReaderIterator.next(ImageIO.java:513) at org.apache.pdfbox.filter.Filter.findImageReader(Filter.java:155) at org.apache.pdfbox.filter.DCTFilter.decode(DCTFilter.java:58) at org.apache.pdfbox.cos.COSInputStream.create(COSInputStream.java:80) at org.apache.pdfbox.cos.COSStream.createInputStream(COSStream.java:175) at org.apache.pdfbox.pdmodel.common.PDStream.createInputStream(PDStream.java:243) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.createInputStream(PDImageXObject.java:791) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.from8bit(SampledImageReader.java:517) at org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:226) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:481) at org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:462) at org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1110) at org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:67) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155) at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:277) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:347) at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:268) at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:254) at technology.tabula.Utils.pageConvertToImage(Utils.java:285) at technology.tabula.detectors.NurminenDetectionAlgorithm.detect(NurminenDetectionAlgorithm.java:101) at technology.tabula.CommandLineApp$TableExtractor.extractTablesBasic(CommandLineApp.java:421) at technology.tabula.CommandLineApp$TableExtractor.extractTables(CommandLineApp.java:408) at technology.tabula.CommandLineApp.extractFile(CommandLineApp.java:180) at technology.tabula.CommandLineApp.extractFileTables(CommandLineApp.java:124) at technology.tabula.CommandLineApp.extractTables(CommandLineApp.java:106) at technology.tabula.CommandLineApp.main(CommandLineApp.java:76)