Closed michelcrypt4d4mus closed 6 months ago
When I downgrade to 3.14.0 this issue goes away so I think it can be confirmed as a regression. Here's a file that was failing in 3.16.4 but working fine in 3.14.0 (also usable for tests): FTX Claim Skybridge Capital 30062023113350File971325116.pdf
images from FTX Claim SC30 01072023101624File595287144.pdf : iss2266a_images.zip
images from FTX.Claim.Skybridge.Capital.30062023113350File971325116.pdf iss2266b_images.zip
@michelcrypt4d4mus Can you please indicate the exact images that used to fail: Checking all images during checks is too much time consuming
I'm not 100% sure this is a PyPDF issue though I suspect it is a regression introduced single 3.14.0 because this never used to happen in my application and now it happens quite frequently despite both the calling code and the
PyTesseract
package being unchanged though there's at least a small chance there's some issue in the underlying Tesseract binary.Environment
Code + PDF
The code is here, in particular these lines where a
PIL.Image
object is extracted from the PDF:produce a
PIL.Image
object that is passed toPyTesseract
here:PyTesseract
then fails with this:PDF file
You can use the PDF file in tests. FTX Claim SC30 01072023101624File595287144.pdf
Traceback