utop does not start - Githubissues

hannesm commented 5 years ago

I have: OCaml 4.07.1, zed 2.0.1, utop 2.4.0 (anything else you'd like to know?), my .ocamlinit:

(* Added by OPAM. *)
let () =
  try Topdirs.dir_directory (Sys.getenv "OCAML_TOPLEVEL_PATH")
  with Not_found -> ()
;;

#use "topfind";;
#thread;;
#utop_prompt_dummy;;

while trying to start utop, I get the following output:

# utop
────────────────┬─────────────────────────────────────────────────────────────┬────────────────
                │ Welcome to utop version 2.4.0 (using OCaml version 4.07.1)! │                
                └─────────────────────────────────────────────────────────────┘                
Findlib has been successfully loaded. Additional directives:
  #require "package";;      to load a package
  #list;;                   to list the available packages
  #camlp4o;;                to load camlp4 (standard syntax)
  #camlp4r;;                to load camlp4 (revised syntax)
  #predicates "p,q,...";;   to set these predicates
  Topfind.reset();;         to force that packages will be reloaded
  #thread;;                 to enable threads

Fatal error: exception Zed_string.Invalid("at position 0: invalid start of Zed_char sequence", "\226\128\13918:\226\128\13921:\226\128\13917- \226\128\13914:\226\128\13941:\226\128\13905;;")

kandu commented 5 years ago

"\226\128\13918:\226\128\13921:\226\128\13917- \226\128\13914:\226\128\13941:\226\128\13905;;" decoded to UChar hex: 200B 31 38 3A 200B 32 31 3A 200B 31 37 2D 20 200B 31 34 3A 200B 34 31 3A 200B 30 35 3B 3B

that is ‛18:‛21:‛17- ‛14:‛41:‛05;;

the ‛ character is of type 'General Punctuation' and is belong to range (0x200B, 0x200F) which used to be deemed as combining marks. I'll update the combining_set in CharInfo_width to resolve this issue.

kandu commented 5 years ago

https://github.com/ocaml/opam-repository/pull/14332

kandu commented 5 years ago

hmm ‛18:‛21:‛17- ‛14:‛41:‛05;; itself is not a legal ocaml statement, and it looks odd to me.

dbuenzli commented 5 years ago

Not sure if that's related but for a few days I had the following (bogus) exception trace being thrown in my face when starting utop 2.4.0

Fatal error: exception Invalid_argument("String.sub / Bytes.sub")
Raised at file "src/core/lwt.ml", line 2998, characters 28-29
Called from file "src/unix/lwt_main.ml", line 26, characters 8-18
Called from file "src/lib/uTop_main.ml", line 1422, characters 2-32
Called from file "src/lib/uTop_main.ml", line 1470, characters 6-30
Called from file "src/lib/uTop_main.ml", line 1491, characters 4-25

just thought now of rm ~/.utop-history and utop works again. Sadly I didn't keep a copy of that file.

hannesm commented 5 years ago

@dbuenzli ah, thanks for the tip. works for me after moving .utop-history out of the way

kandu commented 5 years ago

@dbuenzli, @hannesm: see #287, some recent changes triggered a hidden bug.

kandu commented 5 years ago

https://github.com/ocaml/opam-repository/pull/14332#issuecomment-504668529

kandu commented 5 years ago

utop encountered malformed character sequence unexpectedly. This happens because you typed some malformed character sequence with old utop.

With utop >= 2.4, malformed characters are not allowed and if encountered, utop fixes them automatically. That is, if we type two combining marks dangling there, utop will append these dangling characters to the immediately subsequent normal character. And then, all the statements stored in the .utop-history are well formed zed_strings.

Auto-fix only works with utop >= 2.4 when utop is running and reading user input. So a historical .utop-history file generated with old utop may still cause trouble.

dbuenzli commented 5 years ago

utop encountered malformed character sequence unexpectedly. [...] With utop >= 2.4, malformed characters are not allowed and if encountered, utop fixes them automatically. That is, if we type two combining marks dangling there, utop will append these dangling characters to the immediately subsequent normal character

What do you call malformed sequences ? There's no such thing in Unicode, any character (Unicode scalar value to be precise) sequence is well-formed. You can only have malformed encodings of Unicode character sequences.

kandu commented 5 years ago

In general I agree with you.

What do you call malformed sequences ?

Let's have a look at some math expressions:

2 / 1 = 2 It's well-encoded and is well-formed sequence. 2 \ 1 = 2 It's malformed encoding. 2 / 0, in this expression, 2, /, 0 are all well-encoded. But I call it malformed sequence.

How do we represent(display) a variation selector? It's of width 0 and is a selector. There is no glyph info associated with it, so we can just ignore it when it appears individually. As for ̂(It's combined with a space now), An individual ̂ is of width 0 and there is glyph info associated with it. A glyph, occupies some space, and its width is zero? That is ridiculous.

There's no such thing in Unicode

There are such things in real world.

dbuenzli commented 5 years ago

There are such things in real world.

Exactly and hence:

This happens because you typed some malformed character sequence with old utop.

Should not be treated as "malformed"...

kandu commented 5 years ago

We would like to hear more opinions.

kandu commented 5 years ago

and

Should not be treated as "malformed"...

Yeah, it's a bit subtle. I'd like to hear your opinions too.

dbuenzli commented 5 years ago

Yeah, it's a bit subtle. I'd like to hear your opinions too.

Well you changed .utop-history from a regular UTF-8 encoded text file. To a specific obscure file format. I personally wouldn't do that.

kandu commented 5 years ago

It's still a regular utf-8 encoded text file. But now there is a bouncer and an auto-fixer to prevent malformed sequence(from the real world aspect).

dbuenzli commented 5 years ago

And then, all the statements stored in the .utop-history are well formed zed_strings.

Auto-fix only works with utop >= 2.4 when utop is running and reading user input. So a historical .utop-history file generated with old utop may still cause trouble.

I don't know but the above two sentences seem to indicate that this is not the case.

dbuenzli commented 5 years ago

Maybe I should have said arbitrary UTF-8 encoded file.

kandu commented 5 years ago

It's like that there are many undefined behavior in the C language. A C compiler can do anything it chooses for undefined behavior. Even "to make demons fly out of your nose".

What we are talking about is definitely an undefined behavior in the Unicode standard. I, as a kind man, didn't blow up your computer at least. ;)

Arbitrary UTF-8 encoded file, it's always well formed from the Unicode standard aspect, which doesn't mean that it's not ridiculous, and can be represented on a real world screen.

For now, we haven't found and I think we will not find a just-right way to deal with such a ridiculous situation until the Unicode standard clarifies the issue we are encountering.

There are some choices, The action I described above(bouncer, auto-fixer) is one of them and is what I've chosen.

kandu commented 5 years ago

More specifically, what we'd like to hear mostly is, based on current situation, is there a better solution to deal with it? What's the pros and cons of this solution? What and how will it break the backward-compatibility? How will it affect downstream projects(especially, zed, lambda-term and utop) Besides that, other opinions are also welcome.

dbuenzli commented 5 years ago

What we are talking about is definitely an undefined behavior in the Unicode standard. I, as a kind man, didn't blow up your computer at least. ;)

Well you as a kind man made my utop non-functional for three days, made me loose my time trying to investigate a bogus stack trace until I eventually had the idea to actually rm ~/.top_history.

Whatever you do please do not render utop non-functional if there's a problem reading .top_history, that would be kind. Be robust to this.

Some people may edit .top_history thinking it's a plain text file with a text editor on which you will have no control over whatever your notion of "well-formed" is.

For now, we haven't found and I think we will not find a just-right way to deal with such a ridiculous situation until the Unicode standard clarifies the issue we are encountering.

That will likely never happen. Even if it's not, TUS still considers Unicode text to be stateless and will likely continue to do so ad-aeternam. Text is always going to be arbitrary sequences of Unicode scalar values, leaving to applications to do whatever it wishes with it. You may get guidelines at a certain point, but the underlying data is always going to be that way.

Deal with it and be robust with it.

Besides that, other opinions are also welcome.

I don't think we need any opinions here. The user experience you are providing at the moment is absolutely terrible and it should be fixed.

kandu commented 5 years ago

Well you as a kind man made my utop non-functional for three days, made me loose my time trying to investigate a bogus stack trace until I eventually had the idea to actually rm ~/.top_history.

Sorry for that, but the bug(Invalid_argument("String.sub / Bytes.sub")) made you unhappy was not introduced by me, it's a historical hidden bug.

Yeah, People are always too kind, too polite to say something true. I'm very glad to hear your honest and instructive words.

Unicode is just a standard trying to encode, represent and handle real world writing systems. If it's not competent with the real world writing systems. It's fault of Unicode standard, not the real world writing systems. I just expose that current Unicode standard is bullshit in some situations. The current Unicode standard do introduce some stateful sequence, you can't just ignore it and say: "Unicode standard is my god, it's always stateless and we belong to it." What should be fixed is the Unicode standard if it can't represent and handle real world writing systems.

And if you want to express your feeling, your emotion. I hear you, and I'm very sorry for that.

Whatever you do please do not render utop non-functional if there's a problem reading .top_history, that would be kind. Be robust to this.

Some people may edit .top_history thinking it's a plain text file with a text editor on which you will have no control over whatever your notion of "well-formed" is.

Suggestion taken. I'll keep your instructive words in mind.

dbuenzli commented 5 years ago

What should be fixed is the Unicode standard if it can't represent and handle real world writing systems.

Well the reason why people are using it is that it mostly can, albeit not in the restrictive and clean way you would like. In any case I suggest utop should be fixed before you go on to fix the Unicode standard...

And if you want to express your feeling, your emotion. I hear you, and I'm very sorry for that.

I'm not here to express feelings or emotions. I'm here to report a bug that should be fixed so that myself and other users to not lose time with this (and this discussion is not helping w.r.t. last point).

I have absolutely no idea why you can't simply apply whatever processing you are performing to get your clean zed strings on the ;; separated phrases of .utop_history. That doesn't look like rocket science, doesn't need the Unicode standard to be fixed and will make utop robust to whatever you'll find in .utop_history while allowing it to remain an arbitrary UTF-8 encoded text file.

pmetzger commented 5 years ago

@dbuenzli I'm sure patches from you to improve the code would be joyfully accepted.

kandu commented 5 years ago

I won't 'fix' it not only because it's a rare case that a guy will deliberately open an editor, insert some dangling combining marks at the very beginning position of a line.

There are some choices, The action I described above(bouncer, auto-fixer) is one of them and is what I've chosen.

It's not about 'fixing a bug', it's about 'making a choice', any of them has its own pros and cons. There is no 'the only', 'the right' solution in this situation. We can't force others to do what we think is right.

@dbuenzli as @pmetzger said, your patch would be joyfully accepted. I don't think that my choice is the right or the only one.

pmetzger commented 5 years ago

BTW, @kandu, I do think it would be better if the program was robust in the face of ugly Unicode junk in one of the files it is reading. Dying isn't the best result.

kandu commented 5 years ago

Whatever you do please do not render utop non-functional if there's a problem reading .top_history, that would be kind. Be robust to this.

Some people may edit .top_history thinking it's a plain text file with a text editor on which you will have no control over whatever your notion of "well-formed" is.

Suggestion taken. I'll keep your instructive words in mind.

https://github.com/ocaml-community/zed/pull/24 https://github.com/ocaml-community/lambda-term/pull/78

kandu commented 5 years ago

Whatever you do please do not render utop non-functional if there's a problem reading .top_history, that would be kind. Be robust to this.

These two PRs will resolve this issue. It seems I'll turn back being a kind man again ;)

I'm going to take a field investigation, set up an exhibition... it will take about two months. @pmetzger would you please help with this issue and these PRs as I'll respond very infrequently since now? The PRs were drafted in hurry, so please code review, test them before merging.

pmetzger commented 5 years ago

Sadly I don't understand Unicode processing logic very well, but maybe @dbuenzli would like to review those PRs? If he is comfortable I will merge them.

dbuenzli commented 5 years ago

Sorry I'm unfamiliar with what zed is doing or what kind of model it's trying to expose so I can't comment on that PR. ut given @hannesm's output I suspect the lambda-term PR is ok.

kandu commented 5 years ago

This 'model' contains only three rules:

a zstring is a sequence of zchars
a zchar is NULL or Others(control character) or a grapheme
a grapheme consists of a normal printable character with optional subsequent combining characters

Rule 1 and rule 2 are de-facto standard. In fact, only rule 3 can be seen as a new rule.

If there was no such a rule, we would face other awkward problems: How to render an individual combining character on a real world screen?

We abbreviate 'combining character' to 'cc' below.

render the cc combining with a space. The awkward problems will be: What if the user copy this grapheme? He'll get a cooked grapheme, not a character. Not to mention that the width of this cc is zero, it shouldn't occupy a cell.
render the cc combining with the immediately subsequent character. The awkward problems will be: What we are seeing is not a true grapheme. it's actually a polluted character. That‘s confusing. Not to mention that it shouldn't be combined to the subsequent character according to the Unicode standard.
Ignore it. The awkward problems will be: Ignore a glyph which do exist? What if we want to copy&paste a sentence contains an individual cc?

That's why:

any of them has its own pros and cons. There is no 'the only', 'the right' solution in this situation.

I want to avoid these awkward problems from the very beginning. And the model which contains only one rule is fairly simple.

Still, I don't think that my choice is the right or the only one. We would like to hear more opinions.

If you think that my choice is acceptable, you can review the zed patch following that simple rule. And if you think that my choice is stupid. Then it couldn't be better. You patch is highly anticipated.

kandu commented 5 years ago

And the target rendering environment for zed/lambda-term is more restrictive compared to a GUI environment. The rendering screen a terminal or terminal simulator provides for us is literally made up of fixed size cells and the only element we can use to represent a character is character itself.

Ah, I suddenly have a new idea, what about we teach the users a set of rules. We write a book, called <escape characters you should know while using zed/lambda-term based programs> All the awkward problems will go away then ^_^

dbuenzli commented 5 years ago

I'm sorry but I don't have the time to delve into this problem. But here a few comments:

I don't see why you are inventing your own notion of grapheme cluster. Unicode has one, it may not be perfect and won't exactly match what the various tty will think of a user perceived character, but that should be a good starting point for text editing "character" cursor movements on arbitrary Unicode text --- it has for example well-defined answers on what you see as "degenerate" text [1].
Given that you are in a tty setting and cannot get feedback from the rendering layer what you will be able to do will remain anyways a best-effort endeavor, see here for a discussion.
I have the feeling you are trying to solve what are rendering problems at the text editing layer by artificially constraining what kind of text you are able to edit. It seems unwise to do so if you want your library to be able to edit arbitrary UTF-{8,16} encoded text files. You may want to massage the data before handing it out to the tty (but it's not even clear that is a good idea), but I don't think you should massage it before being able to edit it.
You are certainly not to the first person to run into these problem. You might want to have a look into the sources and behaviour of various text editors to see how they deal with these matters.

[1]:

# List.rev (Uuseg_string.fold_utf_8 `Grapheme_cluster (fun acc seg -> seg :: acc) [] "\u{0308}\u{0067}");;
- : string list = ["̈"; "g"]

dbuenzli commented 5 years ago

3. You may want to massage the data before handing it out to the tty (but it's not even clear that is a good idea),

Except, of course, for US-ASCII control characters which should be escaped.

pqwy commented 5 years ago

More specifically, what we'd like to hear mostly is, based on current situation, is there a better solution to deal with it? What's the pros and cons of this solution? What and how will it break the backward-compatibility? How will it affect downstream projects(especially, zed, lambda-term and utop) Besides that, other opinions are also welcome.

I don't understand what is even being fixed here.

If you dump arbitrary Unicode onto a modern terminal - probably as UTF-8 - it will always be rendered as something. This includes the sequences currently being rejected by zed. In this sense, there can be no "malformed" Unicode sequences at all: all representable scalar value sequences will be drawn by a terminal.

The task of a terminal library is simply to figure out what this rendering will look like, in order to predict the positioning.

Assuming this was the real problem being solved here, you could take a look at what notty does:

Rendering units are Unicode grapheme clusters, as detected by the standard segmentation algorithm.
The "width" of a scalar value is approximated by a variant of wcwidth.
The actual width of a rendered grapheme cluster is the sum of wcwidths of its scalar values.

It's debatable what the meaning of an isolated combining character, or a selector, would be, but the terminal will definitely do something, and this procedure tends to predict that something rather well.

It also correctly predicts the output for many real writing systems I've tried, including traditional and simplified Chinese, Hiragana, Katakana, Hangul in all of its normalization forms, abjads like Hebrew and Arabic, most script from India, and of course, anything remotely related to European writing -- and all of that on a wide range of terminals. It's notably a little incorrect on Kannada, but only because of that particular wcwidth.

And it retains the property of being robust in the face of what you consider malformed sequences.

XVilka commented 5 years ago

@pqwy, note there is also a BiDi problem:

Moreover, there is ligature problem, which would be nice to support at some point.

pqwy commented 5 years ago

@XVilka So what are you proposing, exactly?

BiDi, and especially ligatures, have nothing to do with the "malformed sequences" invented by zed.

The procedure above has nothing to do with directionality.

It does have something to do with ligatures, in that it breaks spectacularly when ligatures are drawn as such. But with the current software stack this is purely a property of the rendering layer, and impossible to predict from a text-based program. It does match the behaviour of most terminals you can run, in that they do not draw ligatures.

pmetzger commented 5 years ago

If people want to submit patches to improve the current behavior of the code, they're encouraged to do so.

pqwy commented 5 years ago

@pmetzger Reverting the invention of "malformed sequences", that no other Unicode-processing software has a conception of, let alone the standard, is an easy one. I think.

pmetzger commented 5 years ago

As I said, patches are always welcome!

kandu commented 5 years ago

Thank you @dbuenzli

The information you provided is very helpful. I didn't know grapheme cluster and its segmentation algorithm. What I've been considering are variant selectors and combining characters. So, what had already been implemented is a subset of utilities to process grapheme cluster. And now, I think we can quicken the progress of the proposal made by @nojb, that is, to introduce uucd/uucp into zed and lambda-term. Then we can improve the internal logic in zed/lambda-term step by step steadily.

And I've also taken a look at some text editor, kate/kwrite/gedit/vim/emacs, only gedit works semi-properly when combining characters occur. It uses another character to help representing an isolated combining mark which is confusing only in one case: to use '◌' instead of a space to serve as a indicator-holder. '◌' is rare compared to space which is spread all over English writing system.

as for vim, a tab with a ̂ will destroy both the cursor position and the text representation. emacs deals with this a little better. It does what I've said

render the cc combining with a space. The awkward problems will be: What if the user copy this grapheme? He'll get a cooked grapheme, not a character. Not to mention that the width of this cc is zero, it shouldn't occupy a cell.

kandu commented 5 years ago

Reverting the invention of "malformed sequences", that no other Unicode-processing software has a conception of, let alone the standard, is an easy one. I think.

Yes, it's easy.

I don't care about the easy one. Because to achieve it, we only have to delete some "unnecessary stupid" code I've commited. That will be an easy and elegent work.

What I care about more is the combination of editor engine and terminal manipulating library. How to resolve all the awkward problems I've mentioned. For now, no editor except gedit which works in GUI environment semi-resolved it.

If you dump arbitrary Unicode onto a modern terminal - probably as UTF-8 - it will always be rendered as something.

hmm, as an editor engine, zed provides cursor management. Knowing there is something on the screen/terminal emulator is not enough. As for cursor management, put all unicode sequence on the screen and "guess" where the cursor is?

kandu commented 5 years ago

I have the feeling you are trying to solve what are rendering problems at the text editing layer by artificially constraining what kind of text you are able to edit. It seems unwise to do so if you want your library to be able to edit arbitrary UTF-{8,16} encoded text files. You may want to massage the data before handing it out to the tty (but it's not even clear that is a good idea), but I don't think you should massage it before being able to edit it.

Yes, exactly as what you said.

But If I made a wise decision(and it's fairly easy, elegant), then all the downstream project will be unwise because all the awkward problems will remain in their projects.

kandu commented 5 years ago

But If I made a wise decision(and it's fairly easy, elegant), then all the downstream project will be unwise because all the awkward problems will remain in their projects.

And to do so, we have to drop cursor management in zed too, It's strongly correlated to the represent layer. To allow the unpredictable represent character sequence will lead to inaccurateness.

ocaml-community / utop

utop does not start #288