Closed hannesm closed 9 months ago
For decompress, yes the choice was initial made to use bigstring
. Again, it raises the question if it's better to use a string
or a bigstring
regarding to performances and it's a difficult question.
About bytes
and string
, I'm in favor to use string
for write
(even if Unix.write
uses bytes
) as I did for miou
to avoid some surprises (about shared buffers).
@dinosaure couldn’t we add functions taking bytes
/string
instead of bigstring
in decompress? I feel like in most cases that do not involve using data from another C binding or other kind of buffer trickery, users of the library have to do the conversion from either string
or bytes
anyway so the performance gained in the C code is offset by the constant conversion in the OCaml code. Furthermore every user of decompress need to do said conversion function or use another dependency (e.g. cstruct) which adds unwanted overhead
Unfortunately, this is a major break in the API that cannot be taken lightly. I don't want to sound too conservative but, with @hannesm and @reynir, we've already floated the idea of re-implementing camlzip
in a more minimal way for the purposes of OPAM - especially in view of the decompress
dependencies.
Unfortunately, this is a major break in the API that cannot be taken lightly.
I don’t think it is. I’m not asking to replace the function, mairly to add the functions to make it more user-friendly for people who will (and most will in my opinion) use string / bytes anyway, and would have used a custom conversion function since none are provided.
The support of Tar_gz
is done and I fixed few issues which appeared in this transition. However, I still have some issues when I try to run tests with block device. Unfortunately, I don't have enough knowledge (yet?) to fix them (@reynir?). At least, it seems that the cram test (which tests our otar
tool and Tar_gz
) works perfectly now.
I've tested this PR using my PoC extractor and it looks like it's working perfectly. Performances seem to be the same with this example (if anything ~1% faster but it's in the error margin) and the code is 25% smaller
Here is the diff for anyone curious: https://github.com/kit-ty-kate/ocaml-tar-playground/commit/da2f186d5639e77aa578225a51b5e468f9fad840
I opened a PR against your branch here finishing up the work in https://github.com/hannesm/ocaml-tar/pull/1
All the testsuite now work, there was a problem in Tar_mirage
due to an issue with offsets but I managed to track it down after half a day of debugging and fix it.
Just a tiny technicality question: do you think we can expect this PR to be available as part of an ocaml-tar 3.0.0 release or more like 2.7.0? (I would personally think 3.0.0 is more appropriate given the changes in the API)
I'm only asking because i'm currently on https://github.com/ocaml/opam/issues/5648 which relies on this PR for the release of opam 2.2.0
3.0 it is. I'll soon (this or next week) again review the changes here, optionally push fixes, and merge.
I expect tar 3.0 to land before mid February (the outstanding item is a nice API for foliding over an archive).
Thanks for your commits @reynir @dinosaure @kit-ty-kate. I reviewed the entire PR, reverted some tar-mirage changes (now there's only a really small difference in these modules).
Since the test suite is also passing, I'd be keen to merge this PR.
My plan is to squash & merge -- I merged the main branch here to have OCaml-CI run with older compilers.
initial performance measurement (anyone into scientific benchmarking, please have a go yourself if you like):
downloading index.tar.gz from opam.ocaml.org, and executing otar.exe list index.tar.gz > /dev/null
main branch: 1.016u 0.031s 0:01.04 100.0% 915+767k 0+0io 27pf+0w this branch: 0.811u 0.039s 0:00.85 98.8% 889+735k 0+0io 26pf+0w
EDIT: running otar multiple times smoothes the results: main branch uses ~0.83s and 920k memory; this branch ~0.75s and 885k memory.
since we use String.get_int32_ne, this requires OCaml 4.13+. IMHO that is fine, but we could recover with some "external" definition compatibility with earlier OCaml releases. I don't think it is worth the effort, though. Any other opinion?
It could be done without externals using:
(* TODO: use String.get_int32_ne when ocaml-tar requires OCaml >= 4.13 *)
Bytes.get_int32_ne (Bytes.unsafe_of_string str)
It's your choice to use this or not, but it is to note as this is going to be used in opam we have to add our own patch internally if this stays as-is and it makes parts of the opam packages not installable with OCaml < 4.13
it makes parts of the opam packages not installable with OCaml < 4.13
in particular it would make opam-publish
not installable for people using a system switch on older systems, which actually seems like an issue to me
@kit-ty-kate here we go, compatible with 4.08+.
I started to remove Cstruct.t from the interface.
/cc @reynir and @kit-ty-kate
also cc @dinosaure since I need some help with the tar_gz code -- when I read decompress gz.mli, I think the only input/output is bigstring (i.e. char Bigarray.Array1.t) -- is this right? This means we'll need to come from bytes/string to there!?
I tried to use string whereever possible (to denote "this is immutable"), but figured that our
read
definition passes in a buffer that is read into --> it needs to be bytes. And for some (to me unclear) reason thewrite
function in OCaml (Stdlib and Lwt) usesbytes
. So, I'm no longer sure whether using string is worth it (certainly all the unsafe_to/of_string operations should be avoided).Any initial thoughts? Also, feel free to push further commits on this branch :)