spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.47k stars 653 forks source link

Get results of lexicon lookup #734

Closed lhk closed 4 years ago

lhk commented 4 years ago

Hi,

I'm looking for a library that can do something like stemming or lemmatization for me. Doesn't really have to be proper lemmatization. Ideally, I'm looking for some base reference form, like the singular for every noun and adjective, the infinitive for verbs, ...

So far I've come up with this:

import nlp from 'compromise';
const text = 'the dogs are barking hungrily.'
const doc = nlp(text);
const transformed = doc.verbs().toInfinitive().all().nouns().toSingular().all();

console.log(transformed);

This already does a reasonable job for nouns and verbs. But I'm missing all the conjugated versions of adjectives, adverbs, pronouns, etc.

You say:

Because compromise can conjugate all sorts of forms, it only needs to store one grammatical form.

Apparently, compromise does a lookup in this lexicon and then uses the POS and other information associated with the base form.

Is it possible to retrieve that base form? It sounds like exactly what I need.

spencermountain commented 4 years ago

hey Lars, yes exactly. That's great you've got that bit working. Ya, this is a good question. I've wanted to do this too - create a very reduced format to work with.

The lexicon work is done here - you can see that it also creates superlative and comparative forms for adjectives which you can play-around with here. I'm not sure that helps, in your project. We don't have a superlative->adjective converter, but i'm sure it's pretty easy to do. Maybe we can collaborate on this.

The other parts of speech get a little mushier. Converting numbers may help. Maybe the proper way to normalize an adverb is to just remove it. For pronouns, were you thinking her->she? cheers

lhk commented 4 years ago

Hey, thank you very much for the quick reply :)

The codebase is really nice to read. As far as I can see, the changes between singular/plural etc are done with regexex. There are no lexicon lookups there.

Hm, I think to have the new functionality of looking up the base form of every word, a new lookup table should be added and filled when these regexes are applied. For each automatically generated lexicon entry, store the base word in this table. But that would be quite some overhead.

What do you think?

spencermountain commented 4 years ago

hey Lars, neat idea. yeah the idea of caching the root forms is a cool one. You're right - we derive plural forms for the lexicon, but then loose the link between them. Maybe it could be a plugin of some kind.

I've long wanted to do something like this

nlp('i walked ecstatically').swap({ecstatic:'happy'}).text()
//i walked happily

which is pretty-much the same task as you're trying to do, I think.

Ya, happy to work on this with you. I would prefer to keep it a plugin for now, if possible.

Maybe we first need a adverb plugin first, which stems them (probably just remove/ly$/). Then, I don't know, other normalizations you can think of. Then wrapping those stemming functions into one plugin should be straightforward? I dunno. Any ideas welcomed. cheers