Closed lhk closed 4 years ago
hey Lars, yes exactly. That's great you've got that bit working. Ya, this is a good question. I've wanted to do this too - create a very reduced format to work with.
The lexicon work is done here - you can see that it also creates superlative and comparative forms for adjectives which you can play-around with here. I'm not sure that helps, in your project. We don't have a superlative->adjective
converter, but i'm sure it's pretty easy to do.
Maybe we can collaborate on this.
The other parts of speech get a little mushier. Converting numbers may help. Maybe the proper way to normalize an adverb is to just remove it. For pronouns, were you thinking her->she
?
cheers
Hey, thank you very much for the quick reply :)
The codebase is really nice to read. As far as I can see, the changes between singular/plural etc are done with regexex. There are no lexicon lookups there.
Hm, I think to have the new functionality of looking up the base form of every word, a new lookup table should be added and filled when these regexes are applied. For each automatically generated lexicon entry, store the base word in this table. But that would be quite some overhead.
What do you think?
hey Lars, neat idea. yeah the idea of caching the root forms is a cool one. You're right - we derive plural forms for the lexicon, but then loose the link between them. Maybe it could be a plugin of some kind.
I've long wanted to do something like this
nlp('i walked ecstatically').swap({ecstatic:'happy'}).text()
//i walked happily
which is pretty-much the same task as you're trying to do, I think.
Ya, happy to work on this with you. I would prefer to keep it a plugin for now, if possible.
Maybe we first need a adverb plugin first, which stems them (probably just remove/ly$/
).
Then, I don't know, other normalizations you can think of.
Then wrapping those stemming functions into one plugin should be straightforward? I dunno. Any ideas welcomed.
cheers
Hi,
I'm looking for a library that can do something like stemming or lemmatization for me. Doesn't really have to be proper lemmatization. Ideally, I'm looking for some base reference form, like the singular for every noun and adjective, the infinitive for verbs, ...
So far I've come up with this:
This already does a reasonable job for nouns and verbs. But I'm missing all the conjugated versions of adjectives, adverbs, pronouns, etc.
You say:
Apparently, compromise does a lookup in this lexicon and then uses the POS and other information associated with the base form.
Is it possible to retrieve that base form? It sounds like exactly what I need.