oschwartz10612 / poppler-windows

Download Poppler binaries packaged for Windows with dependencies
MIT License
517 stars 57 forks source link

Empty file while creating Tiff-files from PDF #5

Closed p2k-ko closed 3 years ago

p2k-ko commented 4 years ago

Hi, during testing the Python module pdf2image I recognized an error while creating Tiff-Files from a PDF-document. pdf2image uses pdftocairo in case of the Tiff-format:

https://github.com/Belval/pdf2image/issues/155

I used the 0.90.1 release on Windows 10 and tried to execute pdftocairo directly:

D:\pdf2image_test
(venv) λ pdftocairo.exe -v
pdftocairo version 0.90.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

D:\pdf2image_test
(venv) λ pdftocairo.exe -tiff example.pdf

D:\pdf2image_test
(venv) λ dir example*
 Datenträger in Laufwerk D: ist Daten

 Verzeichnis von D:\pdf2image_test

28.08.2020  11:39                 0 example-1.tif
01.07.2020  11:36           761.716 example.pdf
               2 Datei(en),        761.716 Bytes
               0 Verzeichnis(se), 16.826.216.448 Bytes frei

I tried the same example.pdf on a Debian 10 Linux (with poppler-utils 0.71.):

➜  ~ pdftocairo -tiff example.pdf
➜  ~ ls example* -1              
example-1.tif
example-2.tif
example-3.tif
example-4.tif
example-5.tif
example-6.tif
example-7.tif
example-8.tif
example.pdf
➜  ~    

On Debian 10 it is working as expected. Am I missing any dependencies for pdftocairo on Windows 10?

Best regards Stephan

oschwartz10612 commented 4 years ago

Hi Stephan,

I replicated your issue and also got an empty file. Because it ran without any errors it leads me to believe that it is not a missing dependency.

Unfortunately I am not very familiar with the poppler library itself. This repository was thrown together to package it from conda-forge in a zip for Belval's project and ease of use.

I am sorry to send you further down the rabbit hole, but I would ask the guys over at poppler-feedstock as this likely would need to be fixed in the conda package before I can updated it here.

I apologize that I could not be of more help!

Owen

Belval commented 4 years ago

From looking at the recipe it seems like libtiff is included: https://github.com/conda-forge/poppler-feedstock/blob/f98dc28d3138c459ca8239811f794eaa749af79b/.ci_support/win_.yaml#L22

@p2k-ko if you don't have the time I will probably contact the feedstock maintainers because someone else is having the same issue. Would you be kind enough to see if you can reproduce the issue with this build: https://blog.alivate.com.au/poppler-windows/ ?

p2k-ko commented 4 years ago

@Belval I tried the mentioned build. The issue also occures with this version:

λ pdftocairo.exe -v
pdftocairo version 0.68.0
Copyright 2005-2018 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

λ pdftocairo.exe -tiff example.pdf
-: Error writing TIFF header.
Error writing example-1.tif

The error message "Error writing TIFF header" was not present with the Poppler 0.90. I found the error message in the libtiff:

Belval commented 4 years ago

Then the error is probably not related to feedstock. Do we have anyone who ever successfully converted to TIFF on a Windows machine?

oschwartz10612 commented 4 years ago

@Elephant940 confirmed that the Windows version was never tested after my freetype and cairo patches, and I confess I did not either after it built.

Stupesmith commented 4 years ago

Hello everyone,

If i can add any informations here. I have p2k-ko's exact same problem. I work with PDF files full of jbig2 encoded images.

I have this result when i try to extract the images :

D:\working\extract_tiff>pdfimages -v
pdfimages version 0.90.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

D:\working\extract_tiff>pdfimages -jbig2 my_pdf.pdf .\extract\

D:\working\extract_tiff>dir .\extract

 Répertoire de D:\working\extract_tiff\extract

31/08/2020  16:47    <DIR>          .
31/08/2020  16:47    <DIR>          ..
31/08/2020  16:51            33 831 -000.jb2e
               1 fichier(s)           33 831 octets
               2 Rép(s)  49 204 383 744 octets libres

Here i should have also .jb2g file which is the header necessary to build the image. It could maybe explain p2k-ko's error : "Error writing TIFF header"

And another test directly in python : I use the convert_from_path method of pdf2image

test = convert_from_path(os.path.join(dir, file), fmt="tiff")
Traceback (most recent call last):
  File "C:\Python\venv\Projet_IA\lib\site-packages\IPython\core\interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-9e55175c5bca>", line 1, in <module>
    test = convert_from_path(os.path.join(dir, file), fmt="tiff")
  File "C:\Python\venv\Projet_IA\lib\site-packages\pdf2image\pdf2image.py", line 206, in convert_from_path
    images += _load_from_output_folder(
  File "C:\Python\venv\Projet_IA\lib\site-packages\pdf2image\pdf2image.py", line 499, in _load_from_output_folder
    images.append(Image.open(os.path.join(output_folder, f)))
  File "C:\Python\venv\Projet_IA\lib\site-packages\PIL\Image.py", line 2930, in open
    raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file '...\\AppData\\Local\\Temp\\tmphatmweuv\\da729fb9-2968-4fd2-8e72-3574d3bbacf4-1.tif'

The .tif is created in the temp folder but like in our problem it's empty.

Not sure this post will help, but here it is. I will also try with the 0.68 build.

edit : Same results as p2k-ko:

D:\working\extract_tiff>pdfimages -tiff my_pdf.pdf .\extract\
-: Error writing TIFF header.
I/O Error: Error writing '.\extract\-000.tif'
oschwartz10612 commented 3 years ago

Have we made any progress on this issue?

fawazahmed0 commented 3 years ago

Even I faced this issue

fawazahmed0 commented 3 years ago

ok, this is how I got tiff working , Link to StackOverflow

fawazahmed0 commented 3 years ago

if someone wants to look into this issue, you may want to see how the build is done on windows at here and here and maybe replicate the same build steps at GitHub actions

Stupesmith commented 3 years ago

ok, this is how I got tiff working , Link to StackOverflow

Thank you so much for this solution it's very nice and easy to use.

Another way to build poppler and being able to use pdftocairo to extract tiff from pdf is to use WSL. I succefully did it. But your solution is way easier and "callable" from python.

oschwartz10612 commented 3 years ago

That is a great workaround!

I just built poppler-20.09.0 so see if this fixes the issue.

fawazahmed0 commented 3 years ago

I just built poppler-20.09.0 so see if this fixes the issue.

Nope, doesn't seem to work

oschwartz10612 commented 3 years ago

Okay. I will take this up with the poppler-feedstock guys shortly.

yogi2806 commented 3 years ago

Any ETA on this issue ?

I have raised same issue in poppler's forum if you could follow-up or get some sort of solution from them: https://gitlab.freedesktop.org/poppler/poppler/-/issues/985

fawazahmed0 commented 3 years ago

A similar issue was already raised at here , you can use msys2 package, it doesn't have this problem, here's the steps you can follow

yogi2806 commented 3 years ago

A similar issue was already raised at here , you can use msys2 package, it doesn't have this problem, here's the steps you can follow

Sure thanks, let me try that

oschwartz10612 commented 3 years ago

I apologize for being slow, I have been quite busy.

After building libtiff from the latest source I could find and trying to use it I got the same result.

I also installed msys2 and used their libtiff dlls and it throws the following error: Annotation 2020-11-10 163847

I will reach out to the poppler feedstock guys to get their take tonight.

oschwartz10612 commented 3 years ago

Peter Williams at poppler-feedstock has also identified this as an issue with the libtiff-feedstock and has opened an issue on their repo.

oschwartz10612 commented 3 years ago

This should be fixed in the latest release: https://github.com/oschwartz10612/poppler-windows/releases/tag/v21.01.0

Please let me know if there are any further issues!

frederick0291 commented 3 years ago

Hello,

I am currently having issues with this on poppler. I am trying to convert a 200 page pdf to TIFF. But Poppler was only converting the first page.

I tried running it directly in pdftocairo and it is indeed only converting one page in pdf2cairo.

Converting the pdf file into other image types did not cause any issues and all the pages were converted.

Anybody who can take a look on this?