Open mdekstrand opened 14 years ago
The original intent of adding Byte.b_idx
was for my safety in working with camomile code. I'd like to remove index
if possible. If not possible, then there's no reason to expose b_idx
, and we can simply use it internally.
Batteries' use of UTF is cleaned up in v2
with moving to ulib for unicode.
Actually I am planning to hide UTF-8 module from ulib's public interface. This is because of the obvious safety concern, that in particular, the current implementation of UTF-8 in ulib is performance oriented and uses a lot of unsafe operations.
But I'm open to discuss. What is your opinion?
yorlyuki: if the UTF-8 module is hidden, would there still be some facility for UTF-8 manipulations of OCaml strings, or would all Unicode access have to go through other structures such as ropes?
I think that having UTF-8 and Unicode operations available on OCaml strings (possibly protected via a private type alias) is important, as it allows existing code that operates on UTF-8 or unknown-encoding strings to still be used. The Xmlm library, which I use extensively, uses OCaml strings to store Unicode in UTF-8 as its primary representation; pcre can also operate on UTF-8 data. Dropping the ability to use strings as unicode data would also lose UTF-8's major benefit of allowing Unicode data to be passed through 8-bit-clean string functions without problem or having to change representations.
Making the exposed UTF-8 module fairly limited is probably fine. Also, unsafe implementation details don't bother me too much so long as the interface is designed with types and, where necessary, runtime checks to ensure that client code is safe (or is explicit where it is unsafe).
I'm not sure whether there is a need to handling raw byte strings and (simultaneously) treat them as strings of Unicode. Making UTF-8 module exposed is only useful for such cases. Otherwise, fast conversion from/to UTF-8 byte string and Unicode string such as ropes would be enough unless you are working very large text. But to handling large text, you would need a special data structure (or rope) anyway, so I am not sure about usefulness of UTF-8 module there, too.
Another reason to hide UTF-8 module is that supporting multiple Unicode strings is bit tedious. Camomile uses functors to achieve this, but my impression is that this design is not very popular. Most users use UTF-8 anyway, or just do not understand functors and stop using Camomile.
Anyway, ulib is only an experimental library at this stage, so I do not think that it is a good idea to rely on. But I am willing to cooperate Batteries folks to better API design and so on.
The
BatUTF8
API seems to be inconsistent and/or incompletely documented, particularly with regards to indexes. The CamomileUTF8
module uses the following types:int
, to refer to a character offset from the beginning of the stringindex
, a type abbreviation forint
, to refer to a byte offset from the beginning of the stringWhile
int
andindex
values can be used interchangeably, the distinction is sufficient to make the documentation clear.Batteries introduces the
Byte
module and itsb_idx
type, an abstract type. Most of theBatUTF8
functions, exceptlook
and a couple others, still useindex
rather thanb_idx
.Is there a need for distinct
index
andByte.b_idx
types? If not, it seems that they should be unified and used consistently, with whichever one remains being a private type abbreviation forint
. If yes, then the functionality should be made consistent and complete between them (why can Inext
andprev
index
values but notlook
at them?), with necessary exceptions being clearly explained.Regardless of whether 2 or 3 indexing types are needed, I think that the documentation should be clearer on what the indexing types are and how they relate. I am willing to work on revising the documentation, but need some clarity on what the API intentions are to make sure that I do it correctly.
Further, I think the final call on what indexing types should look like should be guided by the following two principles:
Camomile.UTF8
and Extlib'sUTF8
as possible subject to goal (1).I am guessing that these goals are already considered in
BatUTF8
's design, but thought that documenting them explicitly here would be helpful in clarifying exactly how the module is supposed to work.