Updates min/maxLenth tests to mention graphemes rather than unicode code points

json-schema-org / JSON-Schema-Test-Suite

A language agnostic test suite for the JSON Schema specifications

MIT License

625 stars 209 forks source link

Updates min/maxLenth tests to mention graphemes rather than unicode code points #710

Closed spacether closed 11 months ago

spacether commented 11 months ago

Updates min/maxLenth tests to mention graphemes rather than unicode points Updates the language for the test that verifies that one grapheme is too short when checked against minLength of 2

spacether commented 11 months ago

This came up when working on my java implementation here: https://github.com/openapi-json-schema-tools/openapi-json-schema-generator and reading this article about grapheme length vs code points: http://illegalargumentexception.blogspot.com/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_strnlen

gregsdennis commented 11 months ago

Also ref: https://json-schema.slack.com/archives/CT8QRGTK5/p1701324578397949

Julian commented 11 months ago

Looks good to me too, just change it in the other drafts too

gregsdennis commented 11 months ago

change it in the other drafts too

Oh! Please also check for maxLength tests as well.

karenetheridge commented 11 months ago

Maybe we should pick some characters that actually make sense -- as far as I can tell "\uD83D\uDCA9" isn't valid utf8.

I did some digging in the history and the test was added here: #52 - and indeed this is UTF-16, not UTF-8. we should fix that. The proper sequence for the poop emoji, that should appear in the file, is \u1F4A9 -- but I think we should pick something better -- a sequence of unicode codepoints that combine together into one grapheme.

gregsdennis commented 11 months ago

No, JSON requires that it be broken out into the surrogate pair.

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

karenetheridge commented 11 months ago

aha, indeed - https://datatracker.ietf.org/doc/html/rfc4627#section-2.5

jdesrosiers commented 11 months ago

@spacether Just wanted to make sure you saw this request to add this change to other drafts as well.