NFD form combining characters not picked up as part of word

spencermountain / compromise

modest natural-language processing

http://compromise.cool

MIT License

11.31k stars 645 forks source link

NFD form combining characters not picked up as part of word #1099

Open retorquere opened 3 months ago

retorquere commented 3 months ago

function show(s) {
  return s.replace(/[^\x00-\x7F]/g, c => "\\u" + ("0000" + c.charCodeAt(0).toString(16)).slice(-4))
}
var nlp = require("compromise/one")
var doc = nlp('Poincare\u0301')
for (const term of doc.json({offset:true})[0].terms) {
  console.log(show(JSON.stringify(term, null, 2)))
}

logs

{
  "text": "Poincare",
  "pre": "",
  "post": "\u0301",
  "tags": [],
  "normal": "poincare",
  "index": [
    0,
    0
  ],
  "id": "poincare|002000009",
  "offset": {
    "index": 0,
    "start": 0,
    "length": 8
  }
}

normalizing to NFC does work, but not every combining char combination has an NFC form (eg 'Poincare\u0301 E\u0300\u0304'.normalize('NFC'))

spencermountain commented 3 months ago

hey, good catch! Yeah, I agree that compromise should not tokenize these inline unicode forms. happy to add a guard for this, in the next release. cheers

spencermountain commented 3 months ago

hey, just double-checking something, your example Poincare\u0301 seems to be a punctuation symbol '́' - which arguably should be considered non-word whitepsace maybe.

Can you generate an example where the NFD character is more word-like? I agree it rubs-up against the javascript normalize feature, and maybe our supporting it would just complicate things. lemme know, cheers

retorquere commented 3 months ago

It's just the Combining Acute Accent:

const show = obj => JSON.stringify(obj, null, 2).replace(/[\u007F-\uFFFF]/g, chr => `\\u${(`0000${chr.charCodeAt(0).toString(16)}`).substr(-4)}`)
console.log(show(`e\u0301`.normalize('NFC')))

shows

"\u00e9"

it's easy enough to normalize the input before passing it into tokenization, but that would then be a design constraint, and as mentioned, there are combining characters that have no single-char NFC form.