ocaml-community / zed

Abstract engine for text edition in OCaml
Other
121 stars 16 forks source link

Questions about `Zed_char` #15

Open Drup opened 5 years ago

Drup commented 5 years ago

The documentation about the new Zed_char module is not really sufficient:

Given how central the notion of character is, I think a much more precise documentation is necessary.

These questions are not hypotheticals btw. I can't even review this patch because the new version of zed makes the code awful for reasons that are completely unclear to me.

kandu commented 5 years ago

Hi, Drup.

The original type definition of zed_char.t explains some of your questions:

type Zed_char.t= {
  core: UChar.t;
  combined: UChar.t list;
  width: int; (* width of this char *)
  size: int; (* 1 + List.length combined *)
}

because the structure takes up too much space, it was internally redefined as utf8 string.

  1. It can be a single UChar core with the combined field being an empty list. In this case, the core can be arbitrary code point.
  2. It can alse be a UChar core with a list of combining marks. In this case all the UChars(core and combined) should be printable characters to be logically legal. To be more precise, a grapheme.

How do I turn a single codepoint into a zchar

val unsafe_of_uChar : UChar.t -> Zed_char.t does the job.

When does of_utf8 fails ?

when it encounters a malformed zed_char sequence, i.e a sequence starting with an combining mark dangling there.

What's the different with Zed_utf8.t ?

Internally, they share the same type definition: string which itself is of type sequence of char. We deem a zed_utf8.t to be a bundle of chars, and for zed_char.t, we deem it to be a bundle of zed_utf8.t. that's the difference.

Why is the API not very similar to Uchar ?

a Zed_char.t is a bundle of UChar.t, and It's more linguistic aware(printable/unprintable/combining marks/sequence legality)

Given how central the notion of character is, I think a much more precise documentation is necessary.

Zed_char can be either an arbitrary code point(like UChar.t) or a legal grapheme.

kandu commented 5 years ago

Why is the function that does this unsafe ?

Because it's not aware of the linguistic meaning. The unsafe_ functions doesn't check if it's well formed as a legal grapheme. But we still need these functions, for example, to deal with control characters, to combine a normal char with a sequence of combining marks(which is malformed in _safe functions)

Drup commented 5 years ago

Well, my bug report is really an invitation for you to document the mli (and, it seems, the .ml), rather than answer here. From the look of it, the data structure should definitely be documented in the code. ;)

We deem a zed_utf8.t to be a bundle of chars, and for zed_char.t, we deem it to be a bundle of zed_utf8.t

Does that mean Zed_utf8 is the utf8 representation of an Uchat.t ? If that is the case, is there a reason not to turn Zed_char into Zed_grapheme, and say it must always be a grapheme ?

On one hand, you say that Zed_char.t can be either a unique codepoint, or a grapheme, but on the other hand, the function that makes Zed_char.t from a unique codepoint is called "unsafe". This is contradictory.

In any case, for ease of use of the API, I feel it's essential to have functions that are safe and of type Uchar.t -> Zed_char.t/char -> Zed_char.t, and raise invalid_argument if the input is not a grapheme.


Don't take this the wrong way, I'm only probing you with questions so that we can converge towards a naming/API that people (with reasonable knowledge of unicode) will be able to approach and understand. I slightly regret missing out on the various PRs before the release, but it's not too late to improve things! :)

kandu commented 5 years ago

To answer here first is good, people can give suggestions, ask further questions immediately. After these questions are satisfied, I'll add them to the mli interface.

Yes, Zed_utf8 is the utf8 representation of an UChat.t. "Zed_grapheme", a good name. But names like 'zed_char', 'zed_string' follow the name convention(char, string) and imply that they are essential types while working with the zed library.

I feel it's essential to have functions that are safe and of type Uchar.t -> Zed_char.t/char -> Zed_char.t, and raise invalid_argument if the input is not a grapheme.

Ah, you are right. Someone should have given such a suggention early.

Drup commented 5 years ago

I understand where the name come from. I think it's ok to keep them, as long as the documentation is very explicit on what is what. My proposal would be to enforce (and document) that Zed_char.t is always a grapheme and that Zed_string is a string of graphemes.

Do we really need all those unsafe_* operations ? Could we avoid them through higher level combinators ?

Is it possible to make Zed_utf8 an internal module only, and only go through Uchar.ts when considering individual codepoints?

pmetzger commented 5 years ago

@kandu Please do document the .mli file with a detailed description like the one you've presented here.

kandu commented 5 years ago

My proposal would be to enforce (and document) that Zed_char.t is always a grapheme and that Zed_string is a string of graphemes

It would be like that we extract and purify the ASCII code, define a new type, called english_grapheme and when we deal with real world writing systems/user input, we then define a higher level type upon them, the result would be type char = | Control of code | Printable of english_grapheme | Null, which in my view is overdesigned and will make our code more complicated.

There are two more considerations, one is backward-compatibility, in case a project depends on zed, the mirgrating progress from zed 1 to zed 2 should be as simple as possible. another consideration is the real world, when I type these sentences, not only do I type printable chars, but also \n. Is it a good idea to purify out \n in Zed_char?

So my proposal would be that we release a minor version with two new functions in Zed_char, of_uchar and of_char and create a Q&A wiki page to record those instructive questions above. I'll document some basic info into zed_char.mli and add the link to it in case the Q&A will keep updating.

Drup commented 5 years ago

Q&A are useless, they are not integrated in the API documentation, and thus are hard to find. Nobody reads them and they end up outofdate extremely quickly.

If you want to document things, document them in the .mli

kandu commented 5 years ago

Sorry for the late reply. I've been busy with some meetings these days and didn't read your questions very carefully nor answered it with proper attention. Some of the replies above are non-sense. Sorry for that.

The current API are sufficient, though the naming may not be so deliberate. of_char and of_uChar are not necessary. To add them is a design mistake, in fact.

And to update the document is unnecessary either. If a guy has reasonable knowledge about combining marks, combined glyph or variation selectors, either of them, he'll understand the API entirely. Even if he don't know any of them, just not doing things the wrong way:

I've looked up the PR you mentioned

 let zchar_of_uchar ch =
    Zed_char.of_uChars  [ch;]
    |> function
       | (Some t, _) -> t
       | _ ->  raise @@ Failure (
           "Fail to get a Zed_char.t from "
           ^ (CamomileLibrary.UChar.int_of ch |> string_of_int))

In case we are dealing with user input, John wants to input a with a hat(a combining mark). He first inputs 'a', we do Zed_char.unsafe_of_uChar 'a' and send it to the edit engine. He then inputs '^', we do Zed_char.unsafe_of_uChar '^' and send it to the edit engine. That just works.

In almost all cases, you should not determine if it's a legal grapheme with Zed_char. Because input sequence are always uchar-individual therefore all the combining marks user input are temporarily not a legal grapheme. And if a user input ten uchar, he may end up getting only one grapheme, one core appending with nine combining marks.

Zed_string Zed_edit Zedrope Zed... are already well designed to deal with such a situation. For example, if you Zed_string.append or Zed_rope.append two str: str1, str2. The Zed_string.length or Zed_rope.length result_str may not be equal to length str1 + length str2. Because the new zed automatically done the duty work --- combining mark joining.

The only missing document we should have written in zed_char.mli is:

the prefix 'unsafe' means it doesn't check if it's a valid grapheme, and in most cases, that's the right way to work with Zed_char.

And in case you really want to do that, Zed_char.( prop / prop_uChar / is_printable / is_printable_core in_combining_mark will help).

Octachron commented 5 years ago

@kandu, you already took the time of answering the very valid questions that @Drup asked. You should take the time to write down your answer in a structured way in the .mli file so that anyone that have the same questions can find them directly. In particular, it would be nice if the original and expanded form of the Zed_char.t type would be used as an introduction of the intend behind Zed_char.t

You cannot assume that people will read your mind and understand perfectly the API. It is true that the API is quite straightforward, but there is some wiggling room in the unsafe functions that should be much better documented:

Drup commented 5 years ago

The simple fact that we have this long discussion is proof that we really need some more documentation. :)

Let me be clear about one fact: common use of an API should never have to use unsafe functions. If that's the case, your API is badly designed, and it needs to change. The code pattern in ocp-browser are all extremely standard and not particularly unsafe/surprising/complex, they should not require the use of unsafe functions.

pmetzger commented 5 years ago

@kandu Please do add the documentation to the .mli file. Most of what you need you've already written in this discussion.

kandu commented 5 years ago

I'm back home, squeezing time for it.

kandu commented 5 years ago

How do they fail if the precondition is not respected?

I'm starting to understand why you are concerning about the new features/apis. OCaml community used to live in a simple and elegant world, there is no IPA, no CJK, no glyph variant. And for a long time, we didn't expect that these features would be supported.

So we are now concerning about the precondition, failure, unsafety.

A picture will explain better: image

The literal string in function append_cm is malformed. It's '^' 'b'. The leading char is a dangling dissociated combining mark. What we are seeing is the complicated real world, we even have to deal with malformed strings and take them as legal literal values.

In other words, even these malformed literal values are well concerned and supported, so what we do is

  1. accept the complexity, use the unsafe_ function when dealing with user input.
  2. forget it, let the new zed/lambda-term worry about all the complexity.

And that's one of the reasons why I said

And to update the document is unnecessary either.

p.s. the tolerance are only allowed in user input phase.

But I were wrong. People in the OCaml community are always clever and reasonable guys. They want to infer, analyze, understand things both outwardly and inwardly.

I drafted #18, try to clarify some of the doc comment. But not all the questions above are answered in the PR. Some of the questions above are come from the unfamiliarity. I think to clarify the basic principle, to serve it as a clue for further detailed info is more useful.

And here are some context info about the new zed/lambda-term, https://github.com/ocaml-community/lambda-term/issues/2

kandu commented 5 years ago

Also to mention, the reason why Zed_re module was divided into two Zed_re.Core Zed_re.Raw modules is for regexp matching on glyph variant, which is a common issue while investigate, research ancient CJK scripts. I was going to improve the functionality of Lterm_edit and Zed_edit so that it would be a competent widget to help us study/research into ancient scripts.

It seems that we're going to eliminate the regexp facility in #16. It's fine. Anyway, If necessary, we can implement such a regexp engine in reasonable time.