johnwhitington / camlpdf

OCaml library for reading, writing and modifying PDF files
Other
200 stars 28 forks source link

Switch to safe-string mode #23

Closed alainfrisch closed 6 years ago

alainfrisch commented 6 years ago

Context: https://github.com/ocaml/ocaml/pull/1859 reports camlpdf as one of the last projects on OPAM requiring the unsafe-string mode. Since this mode might go away at some point, or at least not be supported with the default configure-time mode, it is time to adapt camlpdf!

I've chosen to minimize the changes to the public API of camlpdf. In particular, the type Pdfio.bytes keeps its name, exposing a new Pdfio.caml_bytes as an alias to the built-in bytes type. I wrote a first version using a different approach (renaming Pdfio.bytes to Pdfio.pdfbytes), but I think the current version is better.

Which versions of OCaml are supposed to be supported by camlpdf? There is a package emulating "Bytes" in OPAM, but I've used e.g. String.mapi (introduced in 4.02) in this PR.

johnwhitington commented 6 years ago

Thanks for the hard work, Alain.

CamlPDF at the moment supports 3.11.2 and above, because I have customers using AIX / Sparc Solaris and so on, where compilers no longer exist.

However, we do need to move forwards at some point. At the moment, the OPAM package has a patch for the build file to add -unsafe-string on 4.06.0 and above. I can't merge it, because such a switch does not exist in 3.11.2.

So, the new baseline must be 4.02, since that is the first version with Bytes. Thus, your use of String.mapi is fine.

I'm going to play with your patch now, and see if cpdf needs altering too, and do some tests. I'll write again when I have.

johnwhitington commented 6 years ago

I think you've accidentally lost the "zero-copy" efficiency in Pdfgenlex.

The point of having the 17-item array of pre-built strings/bytes for getuntil to read into, is that lex_item can then call int_of_string without a new string being allocated. When lex_item needs to return a string for a LexName it uses String.copy so the string is no longer shared.

In your new version, you use Bytes.to_string in getuntil, thus ensuring that the string is always copied, which defeats the object...

I suspect fixing this requires unsafe casting bytes to string when calling int_of_string and float_of_string in lex_item.

johnwhitington commented 6 years ago

cpdf builds with no changes, but some tests create malformed PDF files. I don't know if this is due to your patch or a problem with cpdf. I'll look into it further at the weekend.

alainfrisch commented 6 years ago

I think you've accidentally lost the "zero-copy" efficiency in Pdfgenlex.

Indeed, the current version induces some extra allocations. But are you sure this impacts performance a lot? Allocation of small objects are super cheap, so I think it could make sense to use a simple Buffer, and avoid the two passes lengthuntil/setinit_string; each characters is fetched by an indirect call to input_byte or a Caml->C call to really_input, so removing the two passes could actually be faster than the previous version. Anyway, int_of_string/float_of_string don't have a version working on bytes, so a copy would be needed anyway (or one would need to rely on unsafe features).

(On a related note, I've observed that camlpdf represents strings in some places as list of characters, and pattern match on that (cf Pdfutil.explode), which is of course very slow. I did not check whether this is only the case only in contexts where performance does not matter too much, but if not, this could be a potential source of significant gains.)

mseri commented 6 years ago

As a side note, as I mentioned in the ocaml/opam-repository#12365 PR, I know of many instances in which mixing -safe-string and -unsafe-string from different libraries ends up producing cmi conflicts on different interfaces when compiling dependent libraries. It would be best to avoid using an explicit flag to avoid problems (so that old compilers will keep mixing strings and bytes as they were doing, and new compilers can perform eventual optimisations).

Unfortunately I have also seen hard to debug corruptions when using unsafe conversions between strings and bytes, but it should not be a problem if you only use it in isolation when you need to call int_of_string/float_of_string (after all this is how things like Unix.write_substring are implemented).

johnwhitington commented 6 years ago

The malformed PDF issue appears to be with the development version of camlpdf, not with your patch, Alain.

I will prepare a modified version of your patch this week and apply it.