spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.5k stars 654 forks source link

Discussion: toTitleCase() & co-ercing existing uppercase #816

Open wesww opened 3 years ago

wesww commented 3 years ago

Issue

toTitleCase only selects and operates on the first character of each word, which in many cases is insufficient.

Examples:

const desiredResult = 'The MRI Machine'

t.equal(nlp('the MRI machine').toTitleCase().text(), desiredResult) // PASS
t.equal(nlp('the mri machine').toTitleCase().text(), desiredResult) // FAIL
t.equal(nlp('THE MRI MACHINE').toTitleCase().text(), desiredResult) // FAIL

Workaround

It's not pretty, but this basically works:

const toTitleCase = (string) => {
  const doc = nlp(string)
  doc.match('#Acronym').toUpperCase()
  doc.match('!#Acronym').toLowerCase().toTitleCase()
  return doc.text()
}
spencermountain commented 3 years ago

that's a very clever solution.

Yeah, I've flipped-and-flopped about this one. I'm not sure what's best. It would be cool if, like your example, was smart about titlecasing per a term's POS-tags. I'd be happy to change it if others agree.

To me, the least-destructive path is not-changing any existing uppercase in the text. I had considered a seperate case-normalize plugin, due to a certain former-tweeter. It would still be fun to do cheers