squeak-smalltalk / squeak-object-memory

Issues and assets related to the Squeak object memory.
https://bugs.squeak.org
MIT License
12 stars 1 forks source link

Proposal to use UTF32InputInterpreter (no leadingChar) for Unicode platforms in Squeak 6.0 #18

Open dram opened 2 years ago

dram commented 2 years ago

Abstract

Currently, in Squeak 6.0 alpha, different language environments use different input interpreters to handle keyboard inputs, in order to adding leadingChars to input characters. For Unicode platforms, there are UTF32CNInputInterpreter, UTF32GreekInputInterpreter (with no use), UTF32JPInputInterpreter, UTF32NPInputInterpreter and UTF32RussianInputInterpreter.

But leadingChar mechanism under Unicode platforms in Squeak 6.0 alpha is quite incomplete, it introduces more problems than its benefits. This proposal is to suggest using UTF32InputInterpreter for Unicode platforms, which will not introduce leadingChars into system.

Rationale

For non-Unicode platforms, Squeak has ubiquitous text converters, input and clipboard interpreters for different languages, which will handle keyboard inputs, file system accessing, clipboard accessing, file read and writing. Those text converters will add leadingChar to every character, in order to cope with problems caused by Han Unification.

For Unicode platforms, only input interpreters have completely implemented for different languages, there are no language variants for UTF8TextConverter, and variants of UTF8ClipboardInterpreter have not implement leadingChar mechanism.

This makes Squeak 6.0 alpha under a quite frustrated situation, i.e. texts from keyboards, clipboards, files and network are inconsistent regarding to leadingChar, and will be non-equal for comparing. The problems caused by this inconsistency are quite hidden and difficult to detect.

Implementation

The implementation is relatively simple, a patch (Multilingual-xw.285) have been submitted to The Inbox.

Future Works

Regarding to the future of leadingChar support for Unicode platforms in Squeak, I can think out of following possibilities:

  1. Add ubiquitous leadingChar support for Unicode platforms for all language environments.
  2. Remove leadingChar from squeak gradually, for both Unicode and non-Unicode platforms.
  3. Retain leadingChar as a mechanism to tagging texts, but require users to specify it explicitly (e.g. with help of UI).

But this is out scope of this proposal, and should be discussed separately.

marceltaeumel commented 2 years ago

But leadingChar mechanism under Unicode platforms in Squeak 6.0 alpha is quite incomplete, it introduces more problems than its benefits.

Why so? Yes, I want to remove the leadingChar mechanism but not for Squeak 6.0. For now, I would rather fix those clipboard and input interpreters.

dram commented 2 years ago

I think that only fix clipboard interpreters is not enough to make following code work:

(UTF8TextConverter decodeByteString: (WebClient httpGet: 'https://squeak.org/documentation/') content) findString: '自由自在'

Some kind of UTF8TextConverter variant need to be introduced, like UTF8CNTextConverter, to add Chinese leadingChar when decoding UTF8 texts.

But how to search Japanese characters in Chinese environment? The code will turned to be quite strange, as a Chinese decoder is used with Japanese texts:

(UTF8CNTextConverter decodeByteString: (WebClient httpGet: 'https://squeak.org/documentation/') content) findString: 'プログラミング'

Another approach would be let UTF8TextConverter adding leadingChar depends on current running language environment. But in that way code will be non-portable between difference environments. e.g. if first sample code is written in Chinese environment and work properly, it will not work in Japanese environment.

BTW, what I proposed to do is disabling leadingChar in Squeak 6.0, not removing it. To make it a bit further, we can add a preference, people who cared about leadingChar can enable it explicitly. This will also pave the way to remove it in future Squeak versions.

itsmeront commented 2 years ago

Hi Xin Wang 先生

Thank you for your work on Unicode! We use squeak to support virtual spaces which allow people from around the world to connect and collaborate. It is my dream to be able to eventually support any language in Squeak including typing, copying and pasting, without having to set the language environment for the shared space or having to pick a default font. Much like on gmail it would be great if I can just copy and paste characters and have them work everywhere. We have people that join in one virtual space that can be from many different places and use many different keyboards and local language settings. I recently had to add support for German keyboards. Setting and supporting local hardware makes sense but it gets complicated when people from different language settings join in a single virtual experience. Is this dream achievable in Squeak? How does removing the leading character change the ability to support multiple languages at the same time?

Thank you! I've enjoyed seeing you working on this and I wish you well!

All the best,

Ron Teitelbaum *Chief Executive Officer3D Immersive Collaboration Corp @. www.3dicc.com

https://www.facebook.com/3DICC https://twitter.com/RonTeitelbaum https://www.linkedin.com/in/ronteitelbaum

On Mon, May 9, 2022 at 8:55 AM Xin Wang @.***> wrote:

I think that only fix clipboard interpreters is not enough to make following code work:

(UTF8TextConverter decodeByteString: (WebClient httpGet: 'https://squeak.org/documentation/') content) findString: '自由自在'

Some kind of UTF8TextConverter variant need to be introduced, like UTF8CNTextConverter, to add Chinese leadingChar when decoding UTF8 texts.

But how to search Japanese characters in Chinese environment? The code will turned to be quite strange, as a Chinese decoder is used with Japanese texts:

(UTF8CNTextConverter decodeByteString: (WebClient httpGet: 'https://squeak.org/documentation/') content) findString: 'プログラミング'

Another approach would be let UTF8TextConverter adding leadingChar depends on current running language environment. But in that way code will be non-portable between difference environments. e.g. if first sample code is written in Chinese environment and work properly, it will not work in Japanese environment.

BTW, what I proposed to do is disabling leadingChar in Squeak 6.0, not removing it. To make it a bit further, we can add a preference, people who cared about leadingChar can enable it explicitly. This will also pave the way to remove it in future Squeak versions.

— Reply to this email directly, view it on GitHub https://github.com/squeak-smalltalk/squeak-object-memory/issues/18#issuecomment-1121063787, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXXDK3PNACKZF3HU2NQTHLVJEDM7ANCNFSM5VKQBZGQ . You are receiving this because you are subscribed to this thread.Message ID: @.*** com>

dram commented 2 years ago

Hi @itsmeront ,

I have a quick look at website of 3DICC, quite impressive!

In your message, you have mentioned several topics, I'll try to reply the ones that I'm familiar with.

  1. "support any language in Squeak including typing, copying and pasting, without having to set the language environment for the shared space or having to pick a default font"

According to my recent experience, multi-language support in Squeak 6.0 alpha is relatively mature, at least for Chinese. "typing, copying and pasting" works, "language environment" will be set automatically, "default font" currently need to be selected manually, but I think it can be chosen programmatically based on language and operating system, using OS's default font as Squeak's "fallback font".

  1. "I recently had to add support for German keyboards"

I'm not familiar with German or other local keyboards, but I think there are three level abstractions regarding to hardware support, i.e. operating system, Squeak VM, Squeak image. In image level, keyboard inputs are handled with an event carrying Unicode keycode, which is unware of diversity of different hardware. In VM level, I'm not sure, as I have not read it's code, but I think it may call some API of underlining operating system to get data from keyboard or IME.

  1. "How does removing the leading character change the ability to support multiple languages at the same time"

leadingChar is Squeak's try to solve the Han Unification problem of Unicode. But I think for Squeak most of this problem can be solved by using different fonts based on default font of underlining operating system, so native texts will be displayed correctly to native users. If communication between difference languages is needed, and users are care about Han Unification problem, then some kind of tagging need to be introduced, like lang attribute in HTML. But I think that kind of mechanism is better to be implemented in application level, not character encoding level.

itsmeront commented 2 years ago

Hello Xin Wang 先生,

I'm sorry, I had not followed the rationale for the changes. I read up on the issue and understand now. Yes, what we need is probably language tagging and automatic font changes to apply. We are working with the i18n framework that Yoshiki-san and Masashi-san developed for us. That is one-half of a very good solution. Thank you for your suggestions to add this support as part of our application. That makes sense. I'm sure your work on unicode will also help our users a great deal. Thank you!

All the best,

Ron Teitelbaum *Chief Executive Officer3D Immersive Collaboration Corp @. www.3dicc.com

https://www.facebook.com/3DICC https://twitter.com/RonTeitelbaum https://www.linkedin.com/in/ronteitelbaum

On Mon, May 9, 2022 at 6:59 PM Xin Wang @.***> wrote:

Hi @itsmeront https://github.com/itsmeront ,

I have a quick look at website of 3DICC, quite impressive!

In your message, you have mentioned several topics, I'll try to reply the ones that I'm familiar with.

  1. "support any language in Squeak including typing, copying and pasting, without having to set the language environment for the shared space or having to pick a default font"

According to my recent experience, multi-language support in Squeak 6.0 alpha is relatively mature, at least for Chinese. "typing, copying and pasting" works, "language environment" will be set automatically, "default font" currently need to be selected manually, but I think it can be chosen programmatically based on language and operating system, using OS's default font as Squeak's "fallback font".

  1. "I recently had to add support for German keyboards"

I'm not familiar with German or other local keyboards, but I think there are three level abstractions regarding to hardware support, i.e. operating system, Squeak VM, Squeak image. In image level, keyboard inputs are handled with an event carrying Unicode keycode, which is unware of diversity of different hardware. In VM level, I'm not sure, as I have not read it's code, but I think it may call some API of underlining operating system to get data from keyboard or IME.

  1. "How does removing the leading character change the ability to support multiple languages at the same time"

leadingChar is Squeak's try to solve the Han Unification problem of Unicode. But I think for Squeak most of this problem can be solved by using different fonts based on default font of underlining operating system, so native texts will be displayed correctly to native users. If communication between difference languages is needed, and users are care about Han Unification problem, then some kind of tagging need to be introduced, like lang attribute in HTML. But I think that kind of mechanism is better to be implemented in application level, not character encoding level.

— Reply to this email directly, view it on GitHub https://github.com/squeak-smalltalk/squeak-object-memory/issues/18#issuecomment-1121664223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXXDK5IDIQUEUMCHO5ZOXTVJGKGVANCNFSM5VKQBZGQ . You are receiving this because you were mentioned.Message ID: @.***>