spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.46k stars 654 forks source link

Add support for synonyms? #505

Open flesler opened 6 years ago

flesler commented 6 years ago

It'd be great, when normalizing text to include synonyms (and even antonyms!). So that all synonyms are normalized to the same word (and not ${antonym} too).

It shouldn't be incredibly complex, it's equivalent to using replace(synonym, normalized) for each case (but much more optimized I hope.

flesler commented 6 years ago

This could make stuff like #Currency more useful, to be able to normalize to either the name or the symbol would be great (unless it can already be done and I missed it)

spencermountain commented 6 years ago

love this idea

flesler commented 6 years ago

I temporarily implemented this myself, adding the following to the "plugin":

synonyms: {
    u: 'you',
    ya: 'you',
    bc: 'because',
    r: 'are',
    sth: 'something',
    pls: 'please',
    sry: 'sorry',
    '&': 'and',
    okay: 'ok',
    congrats: 'congratulations',
    congratz: 'congratulations',
}

Then I iterate and replace(key, synonyms[key])

buildbreakdo commented 6 years ago

If this helps anyone get started, pulled synonyms out of WordNet a while back, they are in WordNet format which I don't particularly like. Elasticsearch docs have some good dialogue about synonyms here https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

https://github.com/buildbreakdo/elasticsearch-wordnet-synonyms

Think solr style synonyms are much more legible:

Solr:

"synonyms" : [
  "i-pod, i pod => ipod",
  "universe, cosmos"
]

Wordnet:

"synonyms" : [
  "s(100000001,1,'abstain',v,1,0).",
  "s(100000001,2,'refrain',v,1,0).",
  "s(100000001,3,'desist',v,1,0)."
]

Could skip all that and do something clever with getters and setters like:

const synonyms = {
  "ipod": ["ipod", "i pod", "i-pod"],
  get "i pod"() {return this["ipod"]},
  get "i-pod"() {return this["ipod"]}
  ...
}

Every word variant maps to the root word which holds all variants including itself. Used on text like cool i-pod:

"cool i-pod".split(' ').map(word => synonyms[word] || word) would output:

["cool", ["ipod", "i pod", "i-pod"]]

Back of the napkin architecture here. :) Issue see with this though (least client side) is synonyms is an 8 meg file. Be cool if this was a part of the compromise repo and is a separate package that you can include and pass to compromise?

spencermountain commented 6 years ago

yeah! very cool @buildbreakdo another thing that would be cool about using word net is you can ensure the Part-of-Speech matches on the term, before making a swap. That will prevent errors like, when i'm really bored, i pod the .. - 😕

because compromise can reliably conjugate verbs to infinitive, and swap plurals back to singular a synonym swticher could do this:

nlp('i walked ecstatically').replace({ecstatic:'happy'}).out()
//i walked happily

you know? I haven't done this for perf reasons, but you could imagine building a clever method to do this - happy to help

flesler commented 6 years ago

If it could do that, it'd be INCREDIBLY cool. Imagine something like nlp('...').paraphrase(). 💯

owendall commented 6 years ago

👍

owendall commented 6 years ago

@spencermountain Not sure how best to help do what you mentioned above...

image

Obviously not implemented.: -)

spencermountain commented 6 years ago

hey owen, it would involve looping through each word and conjugating all the verbs to infinitive, and all the plural nouns to singular.

If you save that string on each term, the replace method could just loop around and look at that string.

I don't wanna do that at tag-time. It would make everything slow.

spencermountain commented 6 years ago

but it would be a wicked plugin. one method to create this doc.cache() 'super-normalized' word. and another method to doc.replace({ecstatic:'happy'})

giorgio79 commented 6 years ago

I would not call "bc" a synonym for "because", but more a slang version, or a contracted form. Just my 2 cents.