spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.49k stars 655 forks source link

Respect tag rules for custom tags #1155

Open tony-scio opened 1 month ago

tony-scio commented 1 month ago

I'd like to be able to define custom tags that are applied only if their allowed parts of speech are respected. It looks like the plugin interface tries to support that, but it doesn't seem to be working. Wondering if this is a bug, feature request, or if there's another way to accomplish this. Here's what I tried:

nlp.plugin({
  tags: {
    Employee: {
      also: ['ProperNoun'],
      not: ['Verb', 'Adverb', 'Adjective'],
    },
  },
  words: {
    will: 'Employee',
  },
})

nlp('Will is an employee').match('#Employee') // Matches like I expect.
nlp('I will go to the store').match('#Employee') // Matches, but I expected not to match since "will" is used as a verb and "Employee" is defined not to be a verb.
spencermountain commented 4 weeks ago

hey Tony, you're right - there's a number of things going wrong with this example. Apologies for the confusion.

Let me look at fixing the default 'will is' tagging. Your plugin looks correct. You may be interested in the freeze() feature, to co-erce all 'will' appearances (co-erce them to your 'will' ?). There's a lot of gross overlap, when the user-defined lex gets beat-up by downstream tagging changes. This freeze feature is supposed to remedy this.

Will put this on the pile, for the next release. thanks cheers

tony-scio commented 4 weeks ago

Thanks! If you point me in the right direction, I could also take a stab at a PR.

Regarding freeze, I see how it can enforce my custom lexicon, but don't see how I can tell it to enforce the default lexicon and apply a custom one only if it fits (at least in a way that'd work against multiple docs). If you wouldn't mind, could you write a couple of lines that'd use freeze to make the above example work on two different docs?

spencermountain commented 3 weeks ago

yeah, i'm torn about this too, and the lex vs freeze thing has a lot of gross mystery to it. I would use the default lex, and cleanup any tagging issues with match().tag() statements.

doc.match('#ProperNoun [will] #Infinitive', 'Verb')
doc.match('[will] #Copula', 'Employee')

that way you're always in control over what you get, and there's no fancy-biz. (or at least, less!) cheers