BatUTF8 needs needs API and/or documentation cleanup

mdekstrand commented 14 years ago

The BatUTF8 API seems to be inconsistent and/or incompletely documented, particularly with regards to indexes. The Camomile UTF8 module uses the following types:

int, to refer to a character offset from the beginning of the string
index, a type abbreviation for int, to refer to a byte offset from the beginning of the string

While int and index values can be used interchangeably, the distinction is sufficient to make the documentation clear.

Batteries introduces the Byte module and its b_idx type, an abstract type. Most of the BatUTF8 functions, except look and a couple others, still use index rather than b_idx.

Is there a need for distinct index and Byte.b_idx types? If not, it seems that they should be unified and used consistently, with whichever one remains being a private type abbreviation for int. If yes, then the functionality should be made consistent and complete between them (why can I next and prev index values but not look at them?), with necessary exceptions being clearly explained.

Regardless of whether 2 or 3 indexing types are needed, I think that the documentation should be clearer on what the indexing types are and how they relate. I am willing to work on revising the documentation, but need some clarity on what the API intentions are to make sure that I do it correctly.

Further, I think the final call on what indexing types should look like should be guided by the following two principles:

Clarity in documentation and safety in client code
As much compatibility with Camomile.UTF8 and Extlib's UTF8 as possible subject to goal (1).

I am guessing that these goals are already considered in BatUTF8's design, but thought that documenting them explicitly here would be helpful in clarifying exactly how the module is supposed to work.

thelema commented 13 years ago

The original intent of adding Byte.b_idx was for my safety in working with camomile code. I'd like to remove index if possible. If not possible, then there's no reason to expose b_idx, and we can simply use it internally.

thelema commented 13 years ago

Batteries' use of UTF is cleaned up in v2 with moving to ulib for unicode.

yoriyuki commented 13 years ago

Actually I am planning to hide UTF-8 module from ulib's public interface. This is because of the obvious safety concern, that in particular, the current implementation of UTF-8 in ulib is performance oriented and uses a lot of unsafe operations.

But I'm open to discuss. What is your opinion?

mdekstrand commented 13 years ago

yorlyuki: if the UTF-8 module is hidden, would there still be some facility for UTF-8 manipulations of OCaml strings, or would all Unicode access have to go through other structures such as ropes?

I think that having UTF-8 and Unicode operations available on OCaml strings (possibly protected via a private type alias) is important, as it allows existing code that operates on UTF-8 or unknown-encoding strings to still be used. The Xmlm library, which I use extensively, uses OCaml strings to store Unicode in UTF-8 as its primary representation; pcre can also operate on UTF-8 data. Dropping the ability to use strings as unicode data would also lose UTF-8's major benefit of allowing Unicode data to be passed through 8-bit-clean string functions without problem or having to change representations.

Making the exposed UTF-8 module fairly limited is probably fine. Also, unsafe implementation details don't bother me too much so long as the interface is designed with types and, where necessary, runtime checks to ensure that client code is safe (or is explicit where it is unsafe).

yoriyuki commented 13 years ago

I'm not sure whether there is a need to handling raw byte strings and (simultaneously) treat them as strings of Unicode. Making UTF-8 module exposed is only useful for such cases. Otherwise, fast conversion from/to UTF-8 byte string and Unicode string such as ropes would be enough unless you are working very large text. But to handling large text, you would need a special data structure (or rope) anyway, so I am not sure about usefulness of UTF-8 module there, too.

Another reason to hide UTF-8 module is that supporting multiple Unicode strings is bit tedious. Camomile uses functors to achieve this, but my impression is that this design is not very popular. Most users use UTF-8 anyway, or just do not understand functors and stop using Camomile.

Anyway, ulib is only an experimental library at this stage, so I do not think that it is a good idea to rely on. But I am willing to cooperate Batteries folks to better API design and so on.

ocaml-batteries-team / batteries-included

BatUTF8 needs needs API and/or documentation cleanup #66