Open retorquere opened 3 months ago
hey, good catch! Yeah, I agree that compromise should not tokenize these inline unicode forms. happy to add a guard for this, in the next release. cheers
hey, just double-checking something, your example Poincare\u0301
seems to be a punctuation symbol '́'
- which arguably should be considered non-word whitepsace maybe.
Can you generate an example where the NFD character is more word-like? I agree it rubs-up against the javascript normalize feature, and maybe our supporting it would just complicate things. lemme know, cheers
It's just the Combining Acute Accent:
const show = obj => JSON.stringify(obj, null, 2).replace(/[\u007F-\uFFFF]/g, chr => `\\u${(`0000${chr.charCodeAt(0).toString(16)}`).substr(-4)}`)
console.log(show(`e\u0301`.normalize('NFC')))
shows
"\u00e9"
it's easy enough to normalize the input before passing it into tokenization, but that would then be a design constraint, and as mentioned, there are combining characters that have no single-char NFC form.
logs
normalizing to NFC does work, but not every combining char combination has an NFC form (eg
'Poincare\u0301 E\u0300\u0304'.normalize('NFC')
)