paulocoutinhox / pdfium-lib

PDFium - Project to compile PDFium library to multiple platforms.
https://pdfviewer.github.io/
MIT License
926 stars 89 forks source link

WASM build does not render images #26

Closed CetinSert closed 3 years ago

CetinSert commented 3 years ago

Describe the bug pdfium.wasm fails to render some images.

To Reproduce https://d15k2d11r6t6rl.cloudfront.net/public/users/Integrators/a0a42ab5-3cb9-4912-84b7-3c6e47330d5c/smart-pr-805/Consumentenfolder_A5_zonder_paskruizen%20_2.pdf

  1. open in https://pdfviewer.github.io/
  2. see error

Expected behavior image

Screenshots image

CetinSert commented 3 years ago

Another one where it is even more noticeable: https://github.com/mozilla/pdf.js/files/5722039/pdfjsError.pdf

expected

image

vs pdfium.wasm

image

paulocoutinhox commented 3 years ago

Hi,

Im trying to solve it too.

I will update the library to check if it was fixed.

I open a discussion about it here: https://groups.google.com/g/pdfium/c/xqMSoBa6ZVU

Thanks.

inzanez commented 3 years ago

Hey there still working on this?

paulocoutinhox commented 3 years ago

Hi,

Yes, but without a solution yet :(

inzanez commented 3 years ago

Well, I would love to join, but currently that's not possible as I don't have enough time. I hope and think that this is going to change towards Q3/Q4 this year.

In the meantime, I thought that this might help: https://github.com/coolwanglu/PDFium.js/

That's another project (quite old though) which ported PDFium to WASM, and it seems that the troubling images are working there. So maybe you can spot something that gives you a hint want might be missing...

p.s. building his project with the current sources will not work.

paulocoutinhox commented 3 years ago

Nice, i check it.

paulocoutinhox commented 3 years ago

Hi,

After read that repository and mine, it use a very old source code that we can see what is changed and the owner don't answer my tweet to make a contact.

I have updated the "pdfium-lib" repository and now it has the last version of pdfium with all updated patches and the template now include pdfium branch and commit.

Check here: https://pdfviewer.github.io/

But it still with images problem.

Any help is welcome!

CetinSert commented 3 years ago

@paulo-coutinho does this affect only the WASM build?

paulocoutinhox commented 3 years ago

Hi,

Apparently yes. I run pdfium_test executable and it worked as expected as google guy suggested as you can see here: https://groups.google.com/g/pdfium/c/xqMSoBa6ZVU

I only need understand why it happen and i will create the patchs.

If anyone can try help me we can solve this.

Thanks.

inzanez commented 3 years ago

I‘ll give it a try tonight,...

inzanez commented 3 years ago

I'm currently testing around with your pre-built binaries, as I cannot seem to make things work compiling from scratch: .../test/pdfium-lib/build/linux/x64/release/lib/libpdfium.a: Unknown format, not a static library!

That basically happens when I try to build test and generate the final wASM (steps 8 and 9 in the WASM compile guide). Any idea why that might be the case?

paulocoutinhox commented 3 years ago

What branch do you testing?

Use 4466 branch.

inzanez commented 3 years ago

Currently the master branch. I will switch then... Can you explain why you are referring to build options like ‚USE_JPEG=1‘ whereas I cannot seemto fund these emsdk options in the repo I pointed out above?

paulocoutinhox commented 3 years ago

Hi,

The flag USE_JPEG=1 mean to use emsdk JPEG port library instead of use the library from pdfium.

If you disable it you will get error:

emcc -MMD -MF obj/third_party/nasm/nasm/realpath.o.d -DUSE_UDEV -DUSE_AURA=1 -DUSE_GLIB=1 -DUSE_NSS_CERTS=1 -DUSE_OZONE=1 -DUSE_X11=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_GNU_SOURCE -DCR_CLANG_REVISION=\"llvmorg-13-init-4720-g7bafe336-1\" -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -DHAVE_CONFIG_H -I../.. -Igen -I../../third_party/nasm -I../../third_party/nasm/asm -I../../third_party/nasm/disasm -I../../third_party/nasm/include -I../../third_party/nasm/output -I../../third_party/nasm/x86 -fno-delete-null-pointer-checks -fno-ident -fno-strict-aliasing --param=ssp-buffer-size=4 -fno-stack-protector -funwind-tables -fPIC -fcolor-diagnostics -fmerge-all-constants -fcrash-diagnostics-dir=../../tools/clang/crashreports -mllvm -instcombine-lower-dbg-declare=0 -Wno-builtin-macro-redefined -D__DATE__= -D__TIME__= -D__TIMESTAMP__= -Xclang -fdebug-compilation-dir -Xclang . -no-canonical-prefixes -O2 -fdata-sections -ffunction-sections -fno-omit-frame-pointer -g0 -ftrivial-auto-var-init=pattern -fvisibility=hidden -Wheader-hygiene -Wstring-conversion -Wtautological-overlap-compare -Werror -Wall -Wno-unused-variable -Wno-misleading-indentation -Wno-missing-field-initializers -Wno-unused-parameter -Wno-c++11-narrowing -Wno-unneeded-internal-declaration -Wno-undefined-var-template -Wno-psabi -Wno-deprecated-register -Wno-implicit-int-float-conversion -Wno-final-dtor-non-final-class -Wno-builtin-assume-aligned-alignment -Wno-deprecated-copy -Wno-non-c-typedef-for-linkage -Wmax-tokens -Wno-unused-function -Wno-string-conversion -Wno-macro-redefined -Wno-sign-compare -Wno-nonnull -Wno-uninitialized -std=c11 -Wno-implicit-fallthrough -c ../../third_party/nasm/nasmlib/realpath.c -o obj/third_party/nasm/nasm/realpath.o
../../third_party/nasm/nasmlib/realpath.c:58:16: error: implicit declaration of function 'canonicalize_file_name' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
    char *rp = canonicalize_file_name(rel_path);
               ^
../../third_party/nasm/nasmlib/realpath.c:58:11: error: incompatible integer to pointer conversion initializing 'char *' with an expression of type 'int' [-Werror,-Wint-conversion]
    char *rp = canonicalize_file_name(rel_path);
          ^    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.

Thanks.

inzanez commented 3 years ago

Yes, I know, I just wondered how the repo above manages to compile without that flag.

Well, never mind, I will need some time to investigate, but it seems that handling embedded images in general poses a problem with the current build.

CetinSert commented 3 years ago

@inzanez hello, have you been able to make any progress on this?

inzanez commented 3 years ago

@cetinsert I'm afraid not. I managed to build everything and tried different build options, but it won't change anything so far. I guess there's no way around digging deeper in the PDFium codebase. I found something that might help in finding the culprit: the WASM build is not handling images at all. It might be that this is somehow related to binding to libJpeg etc. in WASM. I managed to crash the online PDF Viewer and that might lead to the issue:

  1. Convert any PNG / JPEG etc. into a pdf (you could use pdfcpu: pdfcpu import Test.pdf my_image.png)
  2. Render that image using PDFium in WASM mode, it will crash

I will definitely continue looking into this once I have more time, but that will take until around August I'm afraid.

paulocoutinhox commented 3 years ago

Hi,

Can you attach the PDF that crash here?

Maybe it help test and/or find some solution.

Thanks.

inzanez commented 3 years ago

@paulo-coutinho After some more testing it seems that it's very browser-specific, so it doesn't seem to be the wasm that fails, but the browser with the empty output of the wasm module. I can confirm that I don't get any errors from the wasm module.

paulocoutinhox commented 3 years ago

Hi,

But do you know how to solve it?

I tried a lot of things, but without success.

Thanks.

CetinSert commented 3 years ago

@paulo-coutinho Are codecs for embedded images not part of PDFium proper? Do these get built from other projects and linked to PDFium?

paulocoutinhox commented 3 years ago

You can see better here: https://github.com/lukas-w/pdfium

It use "libjpeg_turbo": https://github.com/lukas-w/pdfium/blob/f8d930b68fdd3e9d20434c5a1788205e9ba0e695/DEPS#L88

and here: https://github.com/lukas-w/pdfium/search?q=turbo

inzanez commented 3 years ago

@paulo-coutinho If I turn on debugging for WASM in Chromium and instruct it to pause on exceptions, I register several 'longjmp' exceptions after FPDF_RenderPageBitmap is called. I just tried to rebuild the project so that I could include debug information (https://emscripten.org/docs/porting/Debugging.html), but it seems that I cannot build the project anymore. I don't really know where and why it fails, I just get a lot of error messages. Maybe you can put a build online with debug information included? I hope that we can find the function failing that way...

paulocoutinhox commented 3 years ago

Hi,

Sure, i will do it.

Thanks.

paulocoutinhox commented 3 years ago

Hi,

It is all updated and you can download WASM debug and release here: https://github.com/paulo-coutinho/pdfium-lib/releases/tag/4505

Obs: Only WASM has it for your tests.

It is also published here: https://pdfviewer.github.io/

Thanks for any help to solve it.

inzanez commented 3 years ago

@paulo-coutinho Many thanks for that build, that's quite interesting. Please apologize for the delay, I still have too much work to attend to. I ran the debug build in the browser today, and as I thought it seems to be related to the Jpeg decoding:

image

I haven't had the time to dig into that yet but wanted to share this, maybe it helps someone else in the meantime. I will still continue to work on this whenever I have time.

inzanez commented 3 years ago

@paulo-coutinho Dear Paulo, I finally managed. Please don't ask me why it is that way, as I cannot answer that. I just followed a feeling I had,...I can't put the picture together, but I do have a running version that seems to render the 'faulty' PDFs just fine, without producing errors.

I checked all the distros I used to build the thing so far, and it seems all of them are using libjpeg-turbo as a default. So I created a simple docker container based on debian:latest running docker run -it debian:latest /bin/bash and installed all the things required to build PDFium. Then, I downloaded libjpeg from here: libjpeg, built it with --prefix=/usr and installed it. I then had to copy some include files from /usr/include to /usr/include/x86_... so that the patch script worked, built the thing, and it worked.

I just cross-checked with another build with the same emsdk version and your latest repo but with libjpeg-turbo as the system jpeg-library, and I can confirm it fails again.

So it seems that you should not build it on a system using libjpeg-turbo as the default jpeg library. I would have thought that as they are interchangable that this would not matter at all, as emscripten brings its own jpeg library, so I am still a bit confused.

Maybe you can try and verify, on the other hand I could put together a Dockerfile for the build if you want me to.

paulocoutinhox commented 3 years ago

Hi,

Sure, im trying fixing with your tips. Im making a docker image for wasm with all included and will test it to upload here.

Thanks.

paulocoutinhox commented 3 years ago

I tried your steps and i get this error message when try compile:

COMMAND:

root@1cff8106b674:/app/build/linux/x64/gen/utils# em++ -g -o /app/build/linux/x64/gen/out/pdfium.html -s EXPORTED_FUNCTIONS="$(node function-names ../xml/index.xml)" -s EXPORTED_RUNTIME_METHODS='["ccall", "cwrap"]' custom.cpp @pdfium.rsp -std=c++11 -Wall --no-entry

ERROR

em++: error: '/emsdk/upstream/bin/wasm-emscripten-finalize --detect-features --minimize-wasm-changes -g --dyncalls-i64 --dwarf /app/build/linux/x64/gen/out/pdfium.wasm -o /app/build/linux/x64/gen/out/pdfium.wasm' failed (-9)

My new Dockerfile:

FROM ubuntu:18.04

# general
ARG DEBIAN_FRONTEND=noninteractive

ENV PROJ_TARGET="wasm"
ENV JAVA_VERSION="8"
ENV JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/"

# packages
RUN apt-get -y update
RUN apt-get install -y build-essential sudo file git wget curl cmake ninja-build zip unzip tar python3 python3-pip openjdk-${JAVA_VERSION}-jdk nano lsb-release libglib2.0-dev tzdata doxygen --no-install-recommends && \
    rm -rf /var/lib/apt/lists/* && \
    apt-get clean

# define timezone
RUN echo "America/Sao_Paulo" > /etc/timezone
RUN dpkg-reconfigure -f noninteractive tzdata
RUN /bin/echo -e "LANG=\"en_US.UTF-8\"" > /etc/default/local

# java
ENV PATH=${PATH}:${JAVA_HOME}/bin
RUN echo ${JAVA_HOME}
RUN java -version

# google depot tools
RUN git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git /opt/depot-tools
ENV PATH=${PATH}:/opt/depot-tools

# pdfium - dependencies
RUN mkdir /build
WORKDIR /build
RUN gclient config --unmanaged https://pdfium.googlesource.com/pdfium.git
RUN gclient sync
WORKDIR /build/pdfium
RUN git checkout 72fd656fee19235d9445796edee1e2c0c1e5e395

RUN ln -s /usr/bin/python3 /usr/bin/python
RUN ln -s /usr/bin/pip3 /usr/bin/pip

RUN apt-get install -o APT::Immediate-Configure=false -f apt \
    && apt-get -f install \
    && dpkg --configure -a \
    && apt-get -y dist-upgrade \
    && echo n | ./build/install-build-deps.sh \
    && rm -rf /build

# ninja
RUN ln -nsf /opt/depot-tools/ninja-linux64 /usr/bin/ninja

# dependencies
RUN pip3 install --upgrade pip
RUN pip3 install setuptools docopt python-slugify tqdm

# libjpeg
RUN mkdir /opt/libjpeg
WORKDIR /opt/libjpeg
RUN curl https://ijg.org/files/jpegsrc.v9c.tar.gz -o jpegsrc.v9c.tar.gz
RUN tar -xvf jpegsrc.v9c.tar.gz
WORKDIR /opt/libjpeg/jpeg-9c
RUN ./configure --prefix=/usr
RUN make && make install

# emsdk
RUN mkdir /emsdk
WORKDIR /emsdk
RUN git clone https://github.com/emscripten-core/emsdk.git . 
RUN ./emsdk install 2.0.20
RUN ./emsdk activate 2.0.20
ENV PATH="${PATH}:/emsdk:/emsdk/upstream/emscripten"

# cache system libraries
RUN bash -c 'echo "int main() { return 0; }" > /tmp/main.cc'
RUN bash -c 'source /emsdk/emsdk_env.sh && em++ -s USE_ZLIB=1 -s USE_LIBJPEG=1 -s USE_PTHREADS=1 -s ASSERTIONS=1 -o /tmp/main.html /tmp/main.cc'

# nodejs and npm
RUN curl -sL https://deb.nodesource.com/setup_14.x | sudo -E bash -
RUN apt-get install -y nodejs
RUN npm install -g npm@latest

# working dir
WORKDIR /app

You make something different?

inzanez commented 3 years ago

Let me put together a Dockerfile overcthe week end. I will make a PR.

paulocoutinhox commented 3 years ago

Hi,

Will be nice.

I made a PR with what i do until now: https://github.com/paulo-coutinho/pdfium-lib/pull/30/files

I tried libjpeg manually, libjpeg-turbo manually, but nothing. Same error:

COMMAND:

docker run -v ${PWD}:/app -it pdfium-wasm python3 make.py run generate-wasm

ERROR:

> Compiling with emscripten...
em++: error: '/emsdk/upstream/bin/wasm-emscripten-finalize --minimize-wasm-changes -g --dyncalls-i64 --dwarf /app/build/linux/x64/gen/out/pdfium.wasm -o /app/build/linux/x64/gen/out/pdfium.wasm --detect-features' failed (-9)

And im using version "v9c" because libjpeg of emscripten if version 9c too, only to make "more compatible".

Thanks.

paulocoutinhox commented 3 years ago

Verbose log (updated without rsp):

em++ -v -g -o /app/build/linux/x64/gen/out/pdfium.html -s EXPORTED_FUNCTIONS="$(node function-names ../xml/index.xml)" -s EXPORTED_RUNTIME_METHODS='["ccall", "cwrap"]' custom.cpp /app/build/linux/x64/debug/lib/libpdfium.a -I/app/build/linux/x64/debug/include -s DEMANGLE_SUPPORT=1 -s USE_ZLIB=1 -s USE_LIBJPEG=1 -s WASM=1 -s ASSERTIONS=1 -s ALLOW_MEMORY_GROWTH=1 -std=c++14 -Wall --no-entry
 "/emsdk/upstream/bin/clang++" -target wasm32-unknown-emscripten -DEMSCRIPTEN -fignore-exceptions -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr -D__EMSCRIPTEN_major__=2 -D__EMSCRIPTEN_minor__=0 -D__EMSCRIPTEN_tiny__=20 -D_LIBCPP_ABI_VERSION=2 -Dunix -D__unix -D__unix__ -Werror=implicit-function-declaration -Xclang -iwithsysroot/include/SDL --sysroot=/emsdk/upstream/emscripten/cache/sysroot -Xclang -iwithsysroot/include/compat -v -g -I/app/build/linux/x64/debug/include -std=c++14 -Wall custom.cpp -c -o /tmp/emscripten_temp_t5gzousm/custom_0.o
clang version 13.0.0 (/b/s/w/ir/cache/git/chromium.googlesource.com-external-github.com-llvm-llvm--project 642df18f1437b1fffea2343fa471aebfff128c6e)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /emsdk/upstream/bin
 (in-process)
 "/emsdk/upstream/bin/clang-13" -cc1 -triple wasm32-unknown-emscripten -emit-obj -mrelax-all --mrelax-relocations -disable-free -main-file-name custom.cpp -mrelocation-model static -mframe-pointer=none -fno-rounding-math -mconstructor-aliases -target-cpu generic -fvisibility hidden -debug-info-kind=limited -dwarf-version=4 -debugger-tuning=gdb -v -fcoverage-compilation-dir=/app/build/linux/x64/gen/utils -resource-dir /emsdk/upstream/lib/clang/13.0.0 -D EMSCRIPTEN -D __EMSCRIPTEN_major__=2 -D __EMSCRIPTEN_minor__=0 -D __EMSCRIPTEN_tiny__=20 -D _LIBCPP_ABI_VERSION=2 -D unix -D __unix -D __unix__ -I /app/build/linux/x64/debug/include -isysroot /emsdk/upstream/emscripten/cache/sysroot -internal-isystem /emsdk/upstream/emscripten/cache/sysroot/include/wasm32-emscripten/c++/v1 -internal-isystem /emsdk/upstream/emscripten/cache/sysroot/include/c++/v1 -internal-isystem /emsdk/upstream/lib/clang/13.0.0/include -internal-isystem /emsdk/upstream/emscripten/cache/sysroot/include/wasm32-emscripten -internal-isystem /emsdk/upstream/emscripten/cache/sysroot/include -Werror=implicit-function-declaration -Wall -std=c++11 -fdeprecated-macro -fdebug-compilation-dir=/app/build/linux/x64/gen/utils -ferror-limit 19 -fgnuc-version=4.2.1 -fcxx-exceptions -fignore-exceptions -fexceptions -fcolor-diagnostics -iwithsysroot/include/SDL -iwithsysroot/include/compat -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr -o /tmp/emscripten_temp_t5gzousm/custom_0.o -x c++ custom.cpp
clang -cc1 version 13.0.0 based upon LLVM 13.0.0git default target x86_64-unknown-linux-gnu
ignoring nonexistent directory "/emsdk/upstream/emscripten/cache/sysroot/include/wasm32-emscripten/c++/v1"
ignoring nonexistent directory "/emsdk/upstream/emscripten/cache/sysroot/include/wasm32-emscripten"
#include "..." search starts here:
#include <...> search starts here:
 /app/build/linux/x64/debug/include
 /emsdk/upstream/emscripten/cache/sysroot/include/SDL
 /emsdk/upstream/emscripten/cache/sysroot/include/compat
 /emsdk/upstream/emscripten/cache/sysroot/include/c++/v1
 /emsdk/upstream/lib/clang/13.0.0/include
 /emsdk/upstream/emscripten/cache/sysroot/include
End of search list.
 "/emsdk/upstream/bin/wasm-ld" @/tmp/emscripten_4c6uswj1.rsp
 "/emsdk/upstream/bin/wasm-emscripten-finalize" --minimize-wasm-changes -g --dyncalls-i64 --dwarf /app/build/linux/x64/gen/out/pdfium.wasm -o /app/build/linux/x64/gen/out/pdfium.wasm --detect-features
em++: error: '/emsdk/upstream/bin/wasm-emscripten-finalize --minimize-wasm-changes -g --dyncalls-i64 --dwarf /app/build/linux/x64/gen/out/pdfium.wasm -o /app/build/linux/x64/gen/out/pdfium.wasm --detect-features' failed (-9)
paulocoutinhox commented 3 years ago

I had removed RSP file and put all direct by param, it is not necessary.

paulocoutinhox commented 3 years ago

The most strange is that test command compile and work:

docker run -v ${PWD}:/app -it pdfium-wasm python3 make.py run test-wasm
python -m http.server --directory sample-wasm/build
paulocoutinhox commented 3 years ago

Hi,

Finally make it work!!!!!

https://pdfviewer.github.io/

The changes are on master and publish as release 4505.

Can you check debug version there to see if it throw any error?

inzanez commented 3 years ago

Great!!! I was just starting to work on a Dockerfile. So I can skip that now. So it really was libjpeg vs. turbo, right?

And no, there's no error anymore in the build above!

paulocoutinhox commented 3 years ago

Yes. I installed the same version of emsdk libjpeg and modified wasm.py to copy only required files.

paulocoutinhox commented 3 years ago

Hi @cetinsert,

It was fixed. Can you test and check your PDFs?

Thanks.

CetinSert commented 3 years ago

@paulo-coutinho - testing now!!

CetinSert commented 3 years ago

Awesome! image

CetinSert commented 3 years ago

It just works! image

Thank you everyone!

paulocoutinhox commented 3 years ago

Very nice. Closed finally.

People, consider donate to help project.

Thanks guys for any help.

CetinSert commented 3 years ago

@paulo-coutinho https://Ko-fi.com/home/coffeeshop?txid=af4503c2-7d71-4181-8333-e938642f5e33&mode=public&img=ogiboughtsomeone

paulocoutinhox commented 3 years ago

@cetinsert very thanks man!

KameshRajendran commented 1 year ago

I cant able to open the below PDF files https://pdfviewer.github.io/

File 1 : [WCP1 - 12-05-21 Permit Set - Architecture - T1.pdf]()

File 2 : [A2B1.pdf]()

paulocoutinhox commented 3 months ago

I tested all PDF with my latest version and it is OK, you can test of my web app to check.

The only PDF that i can't download from github is from @KameshRajendran, maybe a github bug, because download don't happen.

The others i tested is OK.

If everything is ok, can you close the issue please?

Thanks.