mirage / ocaml-tar

Pure OCaml library to read and write tar files
ISC License
54 stars 34 forks source link

remove the cstruct dependency #137

Closed hannesm closed 9 months ago

hannesm commented 9 months ago

I started to remove Cstruct.t from the interface.

/cc @reynir and @kit-ty-kate

also cc @dinosaure since I need some help with the tar_gz code -- when I read decompress gz.mli, I think the only input/output is bigstring (i.e. char Bigarray.Array1.t) -- is this right? This means we'll need to come from bytes/string to there!?

I tried to use string whereever possible (to denote "this is immutable"), but figured that our read definition passes in a buffer that is read into --> it needs to be bytes. And for some (to me unclear) reason the write function in OCaml (Stdlib and Lwt) uses bytes. So, I'm no longer sure whether using string is worth it (certainly all the unsafe_to/of_string operations should be avoided).

Any initial thoughts? Also, feel free to push further commits on this branch :)

dinosaure commented 9 months ago

For decompress, yes the choice was initial made to use bigstring. Again, it raises the question if it's better to use a string or a bigstring regarding to performances and it's a difficult question.

About bytes and string, I'm in favor to use string for write (even if Unix.write uses bytes) as I did for miou to avoid some surprises (about shared buffers).

kit-ty-kate commented 9 months ago

@dinosaure couldn’t we add functions taking bytes/string instead of bigstring in decompress? I feel like in most cases that do not involve using data from another C binding or other kind of buffer trickery, users of the library have to do the conversion from either string or bytes anyway so the performance gained in the C code is offset by the constant conversion in the OCaml code. Furthermore every user of decompress need to do said conversion function or use another dependency (e.g. cstruct) which adds unwanted overhead

dinosaure commented 9 months ago

Unfortunately, this is a major break in the API that cannot be taken lightly. I don't want to sound too conservative but, with @hannesm and @reynir, we've already floated the idea of re-implementing camlzip in a more minimal way for the purposes of OPAM - especially in view of the decompress dependencies.

kit-ty-kate commented 9 months ago

Unfortunately, this is a major break in the API that cannot be taken lightly.

I don’t think it is. I’m not asking to replace the function, mairly to add the functions to make it more user-friendly for people who will (and most will in my opinion) use string / bytes anyway, and would have used a custom conversion function since none are provided.

dinosaure commented 9 months ago

The support of Tar_gz is done and I fixed few issues which appeared in this transition. However, I still have some issues when I try to run tests with block device. Unfortunately, I don't have enough knowledge (yet?) to fix them (@reynir?). At least, it seems that the cram test (which tests our otar tool and Tar_gz) works perfectly now.

kit-ty-kate commented 9 months ago

I've tested this PR using my PoC extractor and it looks like it's working perfectly. Performances seem to be the same with this example (if anything ~1% faster but it's in the error margin) and the code is 25% smaller

Here is the diff for anyone curious: https://github.com/kit-ty-kate/ocaml-tar-playground/commit/da2f186d5639e77aa578225a51b5e468f9fad840

kit-ty-kate commented 9 months ago

I opened a PR against your branch here finishing up the work in https://github.com/hannesm/ocaml-tar/pull/1

All the testsuite now work, there was a problem in Tar_mirage due to an issue with offsets but I managed to track it down after half a day of debugging and fix it.

kit-ty-kate commented 9 months ago

Just a tiny technicality question: do you think we can expect this PR to be available as part of an ocaml-tar 3.0.0 release or more like 2.7.0? (I would personally think 3.0.0 is more appropriate given the changes in the API)

I'm only asking because i'm currently on https://github.com/ocaml/opam/issues/5648 which relies on this PR for the release of opam 2.2.0

hannesm commented 9 months ago

3.0 it is. I'll soon (this or next week) again review the changes here, optionally push fixes, and merge.

I expect tar 3.0 to land before mid February (the outstanding item is a nice API for foliding over an archive).

hannesm commented 9 months ago

Thanks for your commits @reynir @dinosaure @kit-ty-kate. I reviewed the entire PR, reverted some tar-mirage changes (now there's only a really small difference in these modules).

Since the test suite is also passing, I'd be keen to merge this PR.

hannesm commented 9 months ago

My plan is to squash & merge -- I merged the main branch here to have OCaml-CI run with older compilers.

hannesm commented 9 months ago

initial performance measurement (anyone into scientific benchmarking, please have a go yourself if you like):

downloading index.tar.gz from opam.ocaml.org, and executing otar.exe list index.tar.gz > /dev/null

main branch: 1.016u 0.031s 0:01.04 100.0% 915+767k 0+0io 27pf+0w this branch: 0.811u 0.039s 0:00.85 98.8% 889+735k 0+0io 26pf+0w

EDIT: running otar multiple times smoothes the results: main branch uses ~0.83s and 920k memory; this branch ~0.75s and 885k memory.

hannesm commented 9 months ago

since we use String.get_int32_ne, this requires OCaml 4.13+. IMHO that is fine, but we could recover with some "external" definition compatibility with earlier OCaml releases. I don't think it is worth the effort, though. Any other opinion?

kit-ty-kate commented 9 months ago

It could be done without externals using:

(* TODO: use String.get_int32_ne when ocaml-tar requires OCaml >= 4.13 *)
Bytes.get_int32_ne (Bytes.unsafe_of_string str)

It's your choice to use this or not, but it is to note as this is going to be used in opam we have to add our own patch internally if this stays as-is and it makes parts of the opam packages not installable with OCaml < 4.13

kit-ty-kate commented 9 months ago

it makes parts of the opam packages not installable with OCaml < 4.13

in particular it would make opam-publish not installable for people using a system switch on older systems, which actually seems like an issue to me

hannesm commented 9 months ago

@kit-ty-kate here we go, compatible with 4.08+.