ocaml / ocaml

The core OCaml system: compilers, runtime system, base libraries
https://ocaml.org
Other
5.43k stars 1.1k forks source link

Add UTF codecs and validators to the `Stdlib` #10660

Closed dbuenzli closed 2 years ago

dbuenzli commented 3 years ago

Since we have Uchar.t UTF encoders in Buffer, I have been thinking (for a long time) it would be nice to add UTF decoders to the Stdlib. For now I focus on simply providing an API for Bytes, the rest can follow later.

I think I finally found a design I'm happy with. It is efficient (no allocations, no exceptions) and can easily be used to devise wasteful higher-level abstractions (e.g Uutf-like folders or Seq based iterators). It also has good properties for best-effort decodes.

Since these things are a pain to tweak and work on with a compiler build on your knees I developed this in this repository. Note that there's a single API candidate there, just a few different implementations of UTF-8 decoding that I wanted to try, see the README for all the details.

Before I consider making a PR, I'd like to:

  1. Know whether upstream wants this.
  2. Gather comments and agree on the API (structure and names).
  3. Agree on the UTF-8 decoder implementation to use.

For 3. I suggest to use pat, it was the fastest on my machine and it lets the compiler nicely works out the dispatch table for us by using character range patterns :-)

Also I still need to work out a little test suite.

Feel free to open issues and/or make PR there aswell if you see it fit.

nojb commented 3 years ago
  1. Know whether upstream wants this.

I only had a brief glance at the API, but I confirm; upstream wants this :) I'll try to give a fuller look tomorrow. Thanks!

nojb commented 3 years ago

In your description you speak of decoders and validators but the proposal also includes encoders (set_utf_*_uchar) which are the counterparts of the existing add_utf_* functions in Buffer. Just for my own understanding:

Finally, I couldn't come up with any obvious use-cases for this set_* functions; do you have any to share?

dbuenzli commented 3 years ago

Thanks @nojb for having a look.

  • the "encoder" part is logically independent of the decoders and validators part, right?

Yes. I adapted the issue title to reflect this.

  • the encoder signature bytes -> int -> Uchar.t -> int differs from existing set_* [...] (raising if there isn't enough space)

Yes it differs. But other set_* functions are fixed length encodings so you statically know beforehand how much you are going to set, note that they raise Invalid_argument, that is programming error.

Both UTF-8 and UTF-16 are variable length encodings. We could have the same design as other set_* functions and require client to have a call to Uchar.utf_{8,16}_byte_length (also part of the proposal and logically independent) before seting their uchars but this means that we perform exactly the same dispatch on the uchar magnitude twice in a row, which I personally find silly.

Finally, I couldn't come up with any obvious use-cases for this set_* functions; do you have any to share?

Encoding in fixed size buffers to write(2). Besides if we want to put the stdlib on a diet, we could reuse these functions to implement those of Buffer.

I also have opened issues https://github.com/dbuenzli/stdlib-utf/issues/1 and https://github.com/dbuenzli/stdlib-utf/issues/2 which need discussion.

nojb commented 3 years ago

Both UTF-8 and UTF-16 are variable length encodings. We could have the same design as other set_* functions and require client to have a call to Uchar.utf_{8,16}_byte_length (also part of the proposal and logically independent) before seting their uchars but this means that we perform exactly the same dispatch on the uchar magnitude twice in a row, which I personally find silly.

Returning the encoding length seems fine to me. What I am less sure about is whether to signal error by returning a negative value. On the one hand, it is a clever way of returning the extra information without any allocation or exception. On the other hand, the general style of the stdlib is to always signal an error either by raising or by using an option type (in this case an option type would be heavy, but I guess raising is a possibility).

dbuenzli commented 3 years ago

What I am less sure about is whether to signal error by returning a negative value.

Maybe it's a matter of view point. Personally I don't see that case as being an error or particularly exceptional. It will happen, so I really don't want to raise. Note that it's the same interface Unix.single_write gives you on fds but with a twist.

If there is strong consensus on not returning negative numbers then I rather suggest to return 0 in case there's no space for encoding. That way the return value of the function is always: number of bytes written.

dbuenzli commented 3 years ago

Also maybe we can change the names, rather than get,set we can use decode,encode or read,write.

dbuenzli commented 3 years ago

If there is strong consensus on not returning negative numbers then I rather suggest to return 0 in case there's no space for encoding.

Also I want, to add, none of the use cases I personally have in mind would suffer from that (basically I would call the write(2) on 0, and start encoding again at position 0 of the buffer).

And for using the functions to reimplement existing Buffer encoders we could simply overapproximate with resize b 4 when the encoding function returns 0.

nojb commented 3 years ago

If there is strong consensus on not returning negative numbers then I rather suggest to return 0 in case there's no space for encoding.

Also I want, to add, none of the use cases I personally have in mind would suffer from that (basically I would call the write(2) on 0, and start encoding again at position 0 of the buffer).

And for using the functions to reimplement existing Buffer encoders we could simply overapproximate with resize b 4 when the encoding function returns 0.

I think this is the last point that remains to be decided before moving on to PR-stage. Let's try to make a decision. Returning 0 seems fine to me. @gasche @alainfrisch do you have an opinion? The question is: given the proposed signature

val set_utf_8_uchar : t -> int -> Uchar.t -> int
  (** [set_utf_8_uchar b i u] UTF-8 encodes [u] at index [i] in [b].
      If the result value [n] is:
      {ul
      {- [n > 0], [n] is the number of bytes that were used for encoding
         [u]. A new character can be encoded at [i + n].}
      {- [n < 0], [b] was left untouched because there was no space left
         at [i] to encode [u]. [-n] is the total number of bytes needed for
         encoding [u].}} *)

we are discussing what to return when there is not enough space:

  1. n < 0 with the length encoded as explained above (initial proposal),
  2. n = 0 (so the function always returns total number of bytes written, also similar to Unix.single_write)
  3. raise an exception

I believe we have a soft consensus on 2. but having one or two extra opinions would be helpful!

gasche commented 3 years ago

I think the original design n < 0 comes from the idea of giving all information without compromising on performance, at the cost of ease-of-use.; with the idea that easier-to-use functions can be built on top of it The new proposal n = 0 makes the function simpler and easier to work with, but loses some information making it a worse building block. (Would it make sense to provide both versions?) If we must only have one, maybe the user-friendly one is a better fit for the stdlib.

dbuenzli commented 3 years ago

The new proposal n = 0 makes the function simpler and easier to work with, but loses some information making it a worse building block.

I think we should also consider:

  1. When the better building block is actually useful. The only use case I have is if you really need to adjust the size of a buffer for squeezing one last character. How often will you need that ? and in that case is it really horrible to call Uchar.utf_{8,16}_byte_length on the char ? It would perfectly fit for the Buffer case but then resize will also be happy being called with 4.
  2. Is that a pattern that would be useful for other kinds of encoders to follow ? Can we find it elsewhere ? Or will only the Sdtlib UTF encoders have that weird interface ?

Personally I'm starting to lean towards the return 0 interface. That's quite an established interface it's the interface of write(2).

alainfrisch commented 3 years ago

I strongly dislike the idea of encoding a 2-case result type into an integer, with different meanings for positive and negative values. I don't see concrete cases where the caller would need to know the exact number of bytes which are missing to encode the uchar, and if the need arises, we have a function that gives the answer, with some small extra cost.

The approach with returning 0 when there is not enough space at least has a uniform meaning (number of bytes actually written). It's slightly unsatisfactory having to check for a specific value to know whether the function succeeded or not, but I'm not opposed to it; it is a good compromise.

I wonder what would be the performance impact of always calling *_byte_length. It's true that it's a bit silly to do the same dispatching twice, but since we are trading this weirdness against some encoding weirdness, it's interesting to get an idea of the performance impact. Also, one could argue that knowing the encoding size in advance could lead to simpler logic on the caller size, e.g. avoiding the need to "try again" after expanding the buffer.

Otherwise, what about returning a (int, int) Result.t? There would only be a small number of values (for utf-8, Ok 1, Ok 2, Ok 3, Ok 4, Error 1, Error 2, Error 3, Error 4) and they can be pre-allocated in the library. So performance might be ok as well, and it would force the caller to explicitly pattern match on the result, which seems more idiomatic than checking for 0. A slight variant would be to use a dedicated sum type, with an unboxed Ok constructor (soon available? cc @gasche @nchataing).

dbuenzli commented 3 years ago

Thanks @alainfrisch for your input.

It's true that it's a bit silly to do the same dispatching twice, but since we are trading this weirdness against some encoding weirdness,

I personally don't find any weirdness in the set_* functions. They are also much easier to use for clients. So on that ground I'll skip the measurement you suggest :-) In fact I'm really starting to think that if I had called them encode_* people would trip out a bit less about the fact that they return a value…

I changed the proposal to "return the number of written bytes" . Personally I see nothing non-idiomatic in that. Again it's just the signature of write(2). And since we seem to agree it's not that useful to know the number of bytes needed when there's not enough space, I don't think we need to unpack a result type or advanced compilation technology for that.

https://github.com/dbuenzli/stdlib-utf/commit/d40aa57ac3643a98f0555b5c0e550d118f3dbd96 Amends the proposal.

dbuenzli commented 3 years ago

I have slightly changed the functions on decodes not to mention bytes as the Uchar.utf_decode type can also be useful to decode wide chars strings (e.g. JavaScript strings), this was discussed here.

All the proposed additions can be found here, I'm glad if people manage to complain before I PR (open an issue on the tracker). I hope to be able to do that sometime next week according to this plan so that this hopefully gets in before we all get frozen.