textlint-rule / textlint-rule-sentence-length

textlint rule that limit maximum length of sentence.
MIT License
7 stars 3 forks source link

feat: count string by codepoint #44

Closed yumetodo closed 2 months ago

yumetodo commented 3 months ago

Abstruct

Unicode says that there are 4 ways to count string length. https://unicode.org/faq/char_combmark.html#7

This commit supports counting by Code points.

Motivation

When we write text something like Japanese, surrogate pair will be used as usual. In such context, restricting string length is painful without considering surrogate pair.

yumetodo commented 2 months ago

@azu Thank you for your review! I applied your suggestions.

FYI: new Intl.Segmenter("ja-JP", { granularity: "grapheme" }) is more precise, but also more complex to implement due to language dependencies.

I just now noticed the API. When we pass undefined as locale, it will cause unstable lint result. So, we need to decide what is to be specified and how to specify it.

However, I think it's out of this PR's scope. countBy? can be extendable to some thing like countBy?: "codeunits" | "codepoints" | "grapheme";.

azu commented 2 months ago

However, I think it's out of this PR's scope. countBy? can be extendable to some thing like countBy?: "codeunits" | "codepoints" | "grapheme";.

Yes, I agree.

azu commented 2 months ago

https://github.com/textlint-rule/textlint-rule-sentence-length/releases/tag/v5.2.0 released