johnwhitington / camlpdf

OCaml library for reading, writing and modifying PDF files
Other
200 stars 29 forks source link

Use decompress instead miniz and avoid linking with C code #26

Closed dinosaure closed 3 years ago

dinosaure commented 6 years ago

cc @Drup

johnwhitington commented 6 years ago

I patched trunk with this and got:

File "pdfflate.ml", line 8, characters 40-47:
Error: The function applied to this argument has type
         ?wbits:int ->
         proof:'weak5 Decompress.B.t ->
         ('weak6, 'weak5) Decompress.Zlib_deflate.t
This argument cannot be applied with label ~witness

This is with 'decompress' from OPAM.

dinosaure commented 6 years ago

Ah yes, I use the last version of decompress which will be available soon. I will ping this PR when it will be available on OPAM.

XVilka commented 5 years ago

@dinosaure I think it is landed in opam-repository already.

XVilka commented 5 years ago

@johnwhitington Can you please reevaluate this PR? Should be fine now.

johnwhitington commented 5 years ago

Can you give an indication of your code's speed when compared to miniz (or zlib?). If it's much slower, I'm unlikely to merge this.

(HIstorical note: all of camlpdf's encryption used to be written in OCaml, many years ago, but I could never get it even close to C, so I substituted the C code you see there now)

I can run tests on my regression suite to make sure your code can cope with all the streams in my example PDFs anyway. Perhaps that might help you by finding any remaining bugs in your code.

Slight off-topic, but from your knowledge of zip, do you have any thoughts on these two comments:

https://github.com/johnwhitington/camlpdf/issues/19#issuecomment-509206682

https://github.com/johnwhitington/camlpdf/issues/19#issuecomment-511936038

?

dinosaure commented 5 years ago

< decompress.1.0.0 is ten times slower than zlib or miniz and decompress.1.0.0 is three times slower than zlib or miniz (I talked about only decompression). You should be interested by this article about how we optimized decompress.

However, it's difficult (impossible?) to do something faster than C.

In other side, decompress take the opportunity of runtime/the bound checking and avoid a segmentation fault (at worst, an exception appear).

XVilka commented 4 years ago

This PR should be updated because of the API changes of decompress library. Though from my experience it's way faster now.

dinosaure commented 4 years ago

Updated with the last version of decompress.1.1.0.

johnwhitington commented 4 years ago

I will have a look at benchmarking this with cpdf's test suite for speed and correctness in the next couple of weeks: let's get some numbers...

johnwhitington commented 4 years ago

I have successfully built camlpdf from your sources. However, when I try to build cpdf to run my tests, I get:

ocamlfind ocamlopt \
                -package camlpdf,decompress.zl -linkpkg \
                 cpdftime.o         -g        -o cpdf \
                tjutil.cmx tjutf16.cmx tjllist.cmx tjparserMonad.cmx tjjson.cmx xmlm.cmx cpdfwriteJSON.cmx cpdfstrftime.cmx cpdfcoord.cmx cpdf.cmx cpdfcommand.cmx cpdfcommandrun.cmx
File "_none_", line 1:
Error: No implementations provided for the following modules:
         Zl referenced from /Users/john/.opam/default/lib/camlpdf/camlpdf.cmxa(Pdfflate)
         De referenced from /Users/john/.opam/default/lib/camlpdf/camlpdf.cmxa(Pdfflate)
make[1]: *** [cpdf] Error 2
make: *** [native-code] Error 2

The only makefile change I made was to add decompress.zl to the Makefile:

# Build the cpdf command line tools and top level
MODS = tjutil tjutf16 tjllist tjparserMonad tjjson \
       xmlm \
       cpdfwriteJSON cpdfstrftime cpdfcoord cpdf cpdfcommand

SOURCES = cpdftime.c $(foreach x,$(MODS),$(x).ml $(x).mli) cpdfcommandrun.ml

RESULT = cpdf
ANNOTATE = true
PACKS = camlpdf decompress.zl

OCAMLFLAGS = -bin-annot
OCAMLNCFLAGS = -g -safe-string -w -3
OCAMLBCFLAGS = -g -safe-string -w -3
OCAMLLDFLAGS = -g

all : native-code native-code-library byte-code-library top htdoc

clean ::
    rm -rf doc foo foo2 out.pdf out2.pdf *.cmt *.cmti

LIBINSTALL_FILES = cpdf.a cpdf.cma cpdf.cmxa \
$(foreach x,$(MODS),$x.mli) $(foreach x,$(MODS),$x.cmi) \
$(foreach x,$(MODS),$x.cmx) $(foreach x,$(MODS),$x.cmti)

install : libinstall

-include OCamlMakefile
dinosaure commented 4 years ago

I'm not familiar with OMakefile but I think the error is about the META file on the source where

requires: "bigarray decompress"

should be replaced by:

requires: "bigarray decompress.zl"

However, if you have a repository where I can reproduce your error, it will help me to understand the issue. I updated the PR according this update.

johnwhitington commented 4 years ago

When I do that, camlpdf won't compile:

ocamlfind ocamlc -package bigarray,decompress -c -bin-annot -g -safe-string -w -3 pdfflate.ml
File "pdfflate.ml", line 14, characters 10-24:
14 |   let w = De.make_window ~bits:15 in
               ^^^^^^^^^^^^^^
Error: Unbound module De

Cpdf seems to build fine until the link step.

Here is the repository for cpdf:

https://github.com/johnwhitington/cpdf-source

There's one line referring to Pdfcode.flate_level in cpdf.ml which you should comment out.

dinosaure commented 4 years ago

Currently, with my last patch (1bf0880) camlpdf compiles. Then, I use opam to pin to this specific version opam pin add camlpdf . and with a minor changes of cpdf-source (https://paste.isomorphis.me/R8N), all compile.

I think you mis-understoud my command where only the META file provided by camlpdf must be change (to require decompres.zl instead decompress).

johnwhitington commented 4 years ago

Thanks. CPDF test suite runs in 94 minutes 30 seconds with OCaml decompress, compared with 43 minutes 28 seconds with camlzip/miniz.c.

There are some differences between which malformed or difficult streams your code and camlzip can cope with. I will analyse that soon, and update this ticket. It may be that your code is useful to CamlPDF as a fallback mechanism if Camlzip cannot decode a stream.

dinosaure commented 4 years ago

Thanks. CPDF test suite runs in 94 minutes 30 seconds with OCaml decompress, compared with 43 minutes 28 seconds with camlzip/miniz.c.

It's better than:

< decompress.1.0.0 is ten times slower than zlib or miniz and decompress.1.0.0 is three times slower than zlib or miniz

Thanks for your feedbacks :+1: !

johnwhitington commented 4 years ago

Here are some failures with decompress -- things which work with camlzip / miniz:

================================================================================
Testing file PDFTests/test.pdf
================================================================================

cpdf -control control.txt -recrypt "PDFTests/test.pdf"  -o "PDFResults/control/test.pdf"
cpdf -control control.txt -recrypt "PDFTests/test.pdf"  -o "PDFResults/control/test.pdf"
inline image reading failed: End_of_file
Failed to understand content on page
cpdf encountered an error. Technical details follow:

Malformed page content

================================================================================
Testing file PDFTests/badnameddestination.pdf
================================================================================

cpdf -scale-to-fit a4portrait -scale-to-fit-scale 0.9 -recrypt "PDFTests/badnameddestination.pdf"  -o "PDFResults/scaletofit/badnameddestination.pdf"
cpdf -scale-to-fit a4portrait -scale-to-fit-scale 0.9 -recrypt "PDFTests/badnameddestination.pdf"  -o "PDFResults/scaletofit/badnameddestination.pdf"
cpdf encountered an unexpected error. Technical Details follow:
De.Queue.Full

================================================================================
Testing file PDFTests/main128fail.pdf
================================================================================

cpdf -list-bookmarks -utf8 -recrypt "PDFTests/main128fail.pdf"  -o "PDFResults/listbookmarks/main128fail.pdf"
cpdf -list-bookmarks -utf8 -recrypt "PDFTests/main128fail.pdf"  -o "PDFResults/listbookmarks/main128fail.pdf"
cpdf encountered an unexpected error. Technical Details follow:
Invalid_argument("bits MUST be between 8 and 15 (17)")

The first one is more worrying, since it means that decompression appears to succeed, but the result is wrong.

The second and third things look like they should be returned as proper decompress errors...

dinosaure commented 4 years ago

Thanks, I don't have time yet to look into it but it's very helpful for me to fix correctly some details about decompress :+1:.