Open kayceesrk opened 4 months ago
I somehow came across this issue, and there are some other utilities available in OCaml:
I use(d) both (actually all three approaches, so crunch as well) in different applications, and they seem to work nicely. With caravan you've to hope that the linker (or strip) isn't working against you.
And from your description of ocaml-crunch, I'd appreciate a PR that removes the chunking. The reason why there is an API is that it is meant for entire directories being embedded (not only a single file). Plus the mirage-kv API is satisfied, so it can act as a key-value store.
Now, I don't know about your mileage, why "super efficient" is crucial.
CC @MisterDA who suggested the task originally.
I experimented with a malfunction-based replacement of crunch. My motivation was that crunch could IIRC be a bit slow due to parsing of very large string literals. I got stuck with how to use it with e.g. dune. Unfortunately, I think the code is on the drive of my now-dead laptop (possibly recoverable).
caravan
isn't portable to macOS or Windows. ppx_blob
is more-or-less conceptually equivalent to crunch
.
A first step could be to expose binary data as string or bytes (think embedded assets, CSS, js, images…). Without special support, the compiler has to lex/parse a potentially long string, handling escape sequences along the way. A second step could be to expose it as an array of integers (think neural network weights, numerical data…). This is more interesting, because usually each integer is parsed by the compiler and generates a node in the AST with location information, which easily becomes extremely costly.
Loosely related, there was https://github.com/ocaml/ocaml/pull/10654 using incbin
to add debug information to complete bytecode executable, sidestepping the C compiler, which cannot either handle debug info as strings or integer arrays.
The interface could be exposed as an extension node, taking a filename as a parameter, just like ppx_blob
, but the compiler should not process the binary data as part of its AST. The C standard defines the parameters limit
, prefix
, suffix
and if_empty
(#embed
). We'd need another parameter selecting the type of which the data should be exposed as, say string
, bytes
, int array
, …
There is ocaml-crunch that takes files, splits them into chunks, generates a source OCaml module with an API. It's not super efficient: files have to be parsed at compile time, and recomposed when accessed. Room for improvement, don't generate OCaml code including the data to be embedded.
Idea: implement that with magic from the compiler, Dune, or cppo. Make files available as
Bytes
, in a module created at compile time.C23 will have an #embed preprocessor directive (under the hood, it's the linker's job). No parsing of data, available as a static const array of bytes. Checkout how Rust or Golang are doing it.