tarides / hackocaml

OCaml hacking ideas, small and large.
MIT License
18 stars 1 forks source link

Super efficient way of embedding files or binary data in OCaml executables #21

Open kayceesrk opened 4 months ago

kayceesrk commented 4 months ago
hannesm commented 1 month ago

I somehow came across this issue, and there are some other utilities available in OCaml:

I use(d) both (actually all three approaches, so crunch as well) in different applications, and they seem to work nicely. With caravan you've to hope that the linker (or strip) isn't working against you.

And from your description of ocaml-crunch, I'd appreciate a PR that removes the chunking. The reason why there is an API is that it is meant for entire directories being embedded (not only a single file). Plus the mirage-kv API is satisfied, so it can act as a key-value store.

Now, I don't know about your mileage, why "super efficient" is crucial.

kayceesrk commented 1 month ago

CC @MisterDA who suggested the task originally.

reynir commented 1 month ago

I experimented with a malfunction-based replacement of crunch. My motivation was that crunch could IIRC be a bit slow due to parsing of very large string literals. I got stuck with how to use it with e.g. dune. Unfortunately, I think the code is on the drive of my now-dead laptop (possibly recoverable).

MisterDA commented 1 month ago

caravan isn't portable to macOS or Windows. ppx_blob is more-or-less conceptually equivalent to crunch.

A first step could be to expose binary data as string or bytes (think embedded assets, CSS, js, images…). Without special support, the compiler has to lex/parse a potentially long string, handling escape sequences along the way. A second step could be to expose it as an array of integers (think neural network weights, numerical data…). This is more interesting, because usually each integer is parsed by the compiler and generates a node in the AST with location information, which easily becomes extremely costly.

Loosely related, there was https://github.com/ocaml/ocaml/pull/10654 using incbin to add debug information to complete bytecode executable, sidestepping the C compiler, which cannot either handle debug info as strings or integer arrays.

The interface could be exposed as an extension node, taking a filename as a parameter, just like ppx_blob, but the compiler should not process the binary data as part of its AST. The C standard defines the parameters limit, prefix, suffix and if_empty (#embed). We'd need another parameter selecting the type of which the data should be exposed as, say string, bytes, int array, …