tc39 / ecma402

Status, process, and documents for ECMA 402
https://tc39.es/ecma402/
Other
524 stars 102 forks source link

Unicode Properties #90

Open srl295 opened 8 years ago

srl295 commented 8 years ago

https://github.com/srl295/es-unicode-properties

sffc commented 4 years ago

~Please move further discussion to the proposal repo.~

https://github.com/srl295/es-unicode-properties/issues

EDIT (May 2021): This proposal is currently stalled, pending more concrete use cases.

my2iu commented 3 years ago

Can you make the Unicode properties needed for internationalization and low-level text rendering available? It’s becoming increasingly common to do low-level text rendering in JavaScript because certain APIs like WebGL require it or because people are making more complex web apps like word processors, paint programs, or graphic design tools. Implementing internationalization support for this low-level rendering like the bidi algorithm, vertical orientation, and text shaping requires a lot of these Unicode properties, so it would be great if there were an API making them available. Right now, libraries like Harfbuzzjs simply include a compressed version of the Unicode database in their code, and it’s not too big, but since web browsers already know this information, it would be great if web browsers made it available to JavaScript. Preferably these properties would be fast to access too.

ryzokuken commented 3 years ago

@my2iu thanks for your comment. Looks like the proposal has not seen a lot of activity lately, but hopefully that would change soon...

srl295 commented 3 years ago

@my2iu @ryzokuken that's why I proposed this, but there was a lot of pushback that there weren't real use cases for anything that couldn't be covered by regex. See https://github.com/srl295/es-unicode-properties

my2iu commented 3 years ago

Your proposed specification uses a lot of strings in its API, so I’m concerned that it would be slow. Regexes might be fine if they were extended to handle all the weird Unicode properties needed for low level text rendering and if the regexes could be optimized to be fast.

srl295 commented 3 years ago

Your proposed specification uses a lot of strings in its API, so I’m concerned that it would be slow. Regexes might be fine if they were extended to handle all the weird Unicode properties needed for low level text rendering and if the regexes could be optimized to be fast.

Not sure why that would be slow, it's mostly the same strings.

Can you make a concrete list of regex properties that are currently missing? I.e. specific properties from the Unicode spec you need?

Also see https://github.com/srl295/es-unicode-properties#why-not-just-use-regex

my2iu commented 3 years ago

Sure. Now that I think about it, something that operates on code points might be the fastest, plus with an API that either has a lot of methods or with a “matcher” object that can be optimized like with regexes. Actually, does JS even have proper support for code points and surrogates yet? The last time I checked, people were still arguing whether JS strings were UCS2, UTF16, or UTF32.

srl295 commented 3 years ago

operates on code points might be the fastest

It could be an overload. The getter could take either a string or an integer codepoint. This is discussed in https://github.com/srl295/es-unicode-properties/issues/5

does JS even have proper support for code points and surrogates yet?

yes.

sffc commented 3 years ago

Strawman from @reed-at-google about what would be necessary for Skia's needs:

https://github.com/google/skia/blob/main/site/docs/dev/design/uni_characterize.md

my2iu commented 3 years ago

That strawman API seems a little wonky. I’m not a huge WASM expert, but I don’t think strings are directly transferable from WASM to JS. You call from WASM to JS, and then from JS, you can reach into the WASM memory space to copy the raw bytes into JS and convert things into JS strings. Since WASM code is normally C++, things would normally be UTF-8, but UTF-16 might be possible as well, though WASM may prefer UTF-8. As such, I’m not sure whether minimizing the number of JS to WASM transitions needs to be necessarily reflected in the API design, and having an API that operates on JS strings (as opposed to typed arrays) isn’t necessarily the fastest thing for WASM either. It does bring up some good points about how batching might improve performance, depending on the overhead of JS to browser calls on VMs.

sffc commented 3 years ago

Request for the "decimal" property: https://github.com/tc39/ecma402/issues/579