srl295 / es-unicode-properties

Unicode properties in ES
https://srl295.github.io/es-unicode-properties/
MIT License
6 stars 0 forks source link

Add more detailed use cases #3

Open sffc opened 5 years ago

sffc commented 5 years ago

In #2, you added to the README,

For applications, they can directly answer questions such as “What kind of script is 𞤘?”, “Is ġ lowercase? ”, or “What is the numeric value of ५?”.

For feature implementers, this is a building block for implementing a wide array of higher level features, such as number parsing, segmentation, regular expressions, and much more.

I think time should be spent working out more concrete examples for the use cases.

@litherum @FrankYFTang

srl295 commented 5 years ago

@litherum @FrankYFTang please see if 0ced4e2 addresses this. (I will make any further updates as a PR.)

srl295 commented 4 years ago

More use cases were requested.

Jamesernator commented 4 years ago

I for one have wanted the Name property of Unicode characters pretty much any time I need to parse any DSLs for output.

e.g.

Expected character in range: "0" (U+0030 DIGIT ZERO) to "9" (U+0039 DIGIT NINE) but got " " (U+0009 CHARACTER TABULATION).

noinkling commented 3 years ago

There are several properties that are helpful in determining what language or script a given string is in when it's unknown, especially when combined with CLDR's script metadata (it would be nice to get an API to expose this data too, like ICU does through the uscript API). This in turn opens up locale-specific processing (including existing APIs). If a string is mixed-script you can divide it up for different processing paths.

As something a little more concrete, say you want to generate readable, Unicode-supporting URL slugs. Intl.Segmenter with granularity: 'word' is a good basis for implementing this when it comes to languages like English which use separators (spaces) between words. But for languages/scripts where text is typically continuous (e.g. Chinese), word segmentation tends not to be particularly useful (in this context), aesthetic (e.g. a boundary every one or two characters for Chinese), or accurate (this is an inherently hard problem), so you might prefer sentence segmentation or no segmentation instead. Given a list of such scripts (such as those identified by "LB letters" in the CLDR metadata), how do we determine if each character in an unknown string falls into that category or not? The current options are:

(those final two aren't necessarily mutually exclusive)

I think you're probably looking for something a little less verbose and more obvious for the readme, but I thought I'd try and contribute anyway, given that this proposal doesn't seem to be gaining any traction.