moonbitlang / core

MoonBit's Core library
https://moonbitlang.com/
Apache License 2.0
643 stars 82 forks source link

utf-8 <=> utf-16 conversion needs to be built-in #484

Open gmlewis opened 6 months ago

gmlewis commented 6 months ago

Currently there is no easy way (that I can find) to convert back and forth between UTF-8 and UTF-16 -encoded strings.

I'm doing this as a workaround: https://github.com/gmlewis/moonbit-pdk/blob/master/pdk/string.mbt and this: https://github.com/gmlewis/moonbit-pdk/blob/a9777b8b71ff1cf2e77a3cdf95244197d24343fd/pdk/host.mbt#L20-L30 but would like to replace these with standard library calls.

gmlewis commented 6 months ago

Anyone who is designing a new programming language should watch this: https://www.youtube.com/watch?v=Ri2NMnSQo4o and ideally avoid UTF-16 entirely. Rust and Go both use UTF-8 for a good reason. If UTF-16 can't be avoided, then maybe a new StringUtf8 type would be nice to have so that users can avoid String as much as possible and only use the UTF-8 variant.

KKKIIO commented 5 months ago

It would be beneficial to include an API that facilitates encoding strings in UTF-8 and writing them to a Buffer, essentially adding Buffer::write_string_utf8. This enhancement should be straightforward to implement leveraging the @string.String::as_iter function, which returns an Iter[Char]. However, it appears that Buffer is defined within the builtin package, which currently limits its use of @string.String::as_iter.

gmlewis commented 3 months ago

In the meantime, @peter-jerry-ye pointed me to encoder and decoder here: https://github.com/peter-jerry-ye/jstream Thank you, @peter-jerry-ye !

Lampese commented 1 month ago

Anyone who is designing a new programming language should watch this: https://www.youtube.com/watch?v=Ri2NMnSQo4o and ideally avoid UTF-16 entirely. Rust and Go both use UTF-8 for a good reason. If UTF-16 can't be avoided, then maybe a new type would be nice to have so that users can avoid as much as possible and only use the UTF-8 variant.StringUtf8``String

I think supporting UTF-16 is necessary. In fact, MoonBit was originally UTF-8, but later switched to UTF-16. Because we have two important backends(Wasm/JavaScript), and Wasm's String proposal (including its integration with JavaScript before this experience) and JavaScript's String are both based on UTF-16, which is why we use UTF-16.

But on the other hand, I fully support the conversion method of UTF8 <=> UTF16.