rakitzis / rc

rc shell -- independent re-implementation for Unix of the Plan 9 shell (from circa 1992)
Other
258 stars 24 forks source link

No UTF-8? #49

Closed ghost closed 1 year ago

ghost commented 6 years ago

seems like this rc is really bad at everything other than ascii characters in interactive mode, it enters some weird state when i start typing on another language. image is it even fixable? rc from plan9 works correctly in tty, in other terminals it thinks that non-standard characters are double-size and so it clears them uncorrectly with backspace

TobyGoodwin commented 6 years ago

I doubt this is rc's fault. It's almost entirely ignorant of character encodings, but since it largely slings around uninterpreted bytestrings it gets away with it. (I'm a bit surprised that ? globs seem to work correctly in the presence of multibyte characters as that's one place where I would expect to need minimal UTF-8 support.)

First thing to check is that your locale settings are correct. What is $LANG ?

ghost commented 6 years ago

yeah, seems like my locale.conf wasn't been read by rc so I had to add LANG to .rcrc manually

ghost commented 6 years ago

Another problem: image

TobyGoodwin commented 6 years ago

Ah yes, thank you! In this case, the command name is being deliberately scrambled by protect() in which.c. It wants to avoid non-printing characters, but uses the ASCII-only isprint(). I reckon I can fix that to handle UTF-8 fairly easily (without need to drag in libicu, for instance). I then worry that we're being UTF-8-centric and what about the -16 and -32 encodings? I simply don't have enough experience to tackle those sensibly. If anyone does, do send a Pull Request!

ghost commented 6 years ago

I'm not sure how this works, but if possible I would like to avoid having anything higher than UTF-8, especially since you don't have enough experience. Most of the time simpler solution is more robust

rakitzis commented 6 years ago

Just saw this thread.. wonder why bother with a protect() at all? This is the simplest solution, and it's in line with rc punting on all UTF-8 issues (for now).

TobyGoodwin commented 6 years ago

I think I wrote protect() when I was much younger. If so, it was in response to some hostile environment or other (might well have been a Windows 3.1 terminal emulation + telnet) and It Seemed Like A Good Idea At The Time.

xyb3rt commented 1 year ago

Should we get rid of protect(), @rakitzis?

rakitzis commented 1 year ago

It's not useful as it stands. Please remove it.

xyb3rt commented 1 year ago

Looking into this, I saw that env -i rc behaves the same as env -i sh when build with EDIT=null on my system. I only got the behaviour from the original comment when building with EDIT=readline. It might be worth it to look into other interactive programs that use readline (e.g. python) to see how they behave and how they get it right.

xyb3rt commented 1 year ago

Python gets it right. This is what they're doing: https://peps.python.org/pep-0538/

rakitzis commented 1 year ago

OK does this boil down to something simple that can be done for rc?

xyb3rt commented 1 year ago

On Linux it basically boils down to overwriting LC_CTYPE to C.UTF-8 if it is C or POSIX at startup. Unfortunately on other systems the value needs to be slightly different.

I'll vote for doing nothing, because we can argue that an LC_CTYPE of C or POSIX asks for this behaviour.

xyb3rt commented 1 year ago

I'm closing this, because rc is encoding-agnostic and works correctly with a properly configured locale.