hsolbrig / PyShEx

ShEx interpreter for ShEx 2.0
Creative Commons Zero v1.0 Universal
24 stars 9 forks source link

Unicode outside BMP #87

Open ericprud opened 1 year ago

ericprud commented 1 year ago

iirc, PyShEx failed tests where the schema (or data?) had codepoints > U+FFFD . I stumbled across a repo that I created for dealing with this in Java and Javascript, both of which use UTF16 internally and thus require the grammar to be written not in terms of codepoints U+10000- but instead surrogate pairs. I don't remember the state of this repot, but it could be handy to clone it and play with the python rather than experimenting in the larger ShEx g4.