spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.44k stars 656 forks source link

extending tagset #406

Closed tamagokun closed 6 years ago

tamagokun commented 7 years ago

Wondering if there is a way to extend the tagset logic for custom tags? I've been poking around the source and don't see anything to indicate that it can be done.

Example:

const lexicon = {
  'jones': 'Doctor'
}

const tags = {
  Doctor: {
    is: 'Person'
  }
}
spencermountain commented 7 years ago

hey mike, yeah it's a good question, and the short answer is that there are a few fleabag ways to do it, but not a good solution, like the one you've suggested.

you can make any tag you want, but to get the Doctor->Person stuff, for now you can do this:

const lexicon={
  jones:['Doctor', 'Person']
}

or alternatively,

var doc=nlp(myText, {jones:'Doctor'}).
doc.match('#Doctor').tagAs('Person')

... but we should really support a clever way to extend the native tag stuff. I'm happy to work on that.

tamagokun commented 7 years ago

seems like providing some kind of API for working with the lexicon/tagset is in order. Right now compromise seems super slow because every time I run it, it "loads" the exact same lexicon that I am supplying.

I'd love to be able to set up my lexicon, tagsets, once, and then set those for compromise.

tamagokun commented 7 years ago

Thanks for providing insight into how to work around the tagset thing right now, it seems to work well!

spencermountain commented 7 years ago

yeah! thanks. your timing is very good for this feature. we can look at including it in v11, which will hopefully be ready sometime this week.

how would this be?

var nlp=require('compromise') //does background init work

nlp.addWords(myLexicon)  //your lexicon (persistent)

nlp.addTags({Person: ['Doctor', 'Nurse', 'Plumber']}) //plug these into the tagging logic

//now this is fast-path
nlp(text1)
nlp(text2)

i got stuck on this just cause i was trying to make a nlp.clone() method, that somehow would let you have two different functions. I still haven't figured out how to do that

tamagokun commented 7 years ago

that's exactly what I need :+1:

owendall commented 7 years ago

Yes, this really helps. :+1:

owendall commented 7 years ago

No sure if this is a digression, but how to we best handle polysemy (?) when we create new lexicons.

First determine the tag (pos) before checking the custom lexicon?

"I doctored the photograph"

spencermountain commented 7 years ago

sorry for the delay,

yeah, for more context-sensitive tagging, i recommend doing it afterwards with .match().tagAs()

var doc=nlp("Have You Met Life Today?")
doc.match('#QuestionWord #Noun met')....
doc.match('met life #Verb')...

or whatever..

i think of it as the lexicon is for not-smart tagging, and the smarter stuff's gotta come afterwards.

tamagokun commented 6 years ago

Just tried out v11, totally amazing. Going to close this issue. Thanks for a great library!