Open flesler opened 6 years ago
This could make stuff like #Currency more useful, to be able to normalize to either the name or the symbol would be great (unless it can already be done and I missed it)
love this idea
I temporarily implemented this myself, adding the following to the "plugin":
synonyms: {
u: 'you',
ya: 'you',
bc: 'because',
r: 'are',
sth: 'something',
pls: 'please',
sry: 'sorry',
'&': 'and',
okay: 'ok',
congrats: 'congratulations',
congratz: 'congratulations',
}
Then I iterate and replace(key, synonyms[key])
If this helps anyone get started, pulled synonyms out of WordNet a while back, they are in WordNet format which I don't particularly like. Elasticsearch docs have some good dialogue about synonyms here https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
https://github.com/buildbreakdo/elasticsearch-wordnet-synonyms
Think solr style synonyms are much more legible:
Solr:
"synonyms" : [
"i-pod, i pod => ipod",
"universe, cosmos"
]
Wordnet:
"synonyms" : [
"s(100000001,1,'abstain',v,1,0).",
"s(100000001,2,'refrain',v,1,0).",
"s(100000001,3,'desist',v,1,0)."
]
Could skip all that and do something clever with getters and setters like:
const synonyms = {
"ipod": ["ipod", "i pod", "i-pod"],
get "i pod"() {return this["ipod"]},
get "i-pod"() {return this["ipod"]}
...
}
Every word variant maps to the root word which holds all variants including itself. Used on text like cool i-pod
:
"cool i-pod".split(' ').map(word => synonyms[word] || word)
would output:
["cool", ["ipod", "i pod", "i-pod"]]
Back of the napkin architecture here. :) Issue see with this though (least client side) is synonyms is an 8 meg file. Be cool if this was a part of the compromise repo and is a separate package that you can include and pass to compromise?
yeah! very cool @buildbreakdo
another thing that would be cool about using word net is you can ensure the Part-of-Speech matches on the term, before making a swap. That will prevent errors like, when i'm really bored, i pod the ..
- 😕
because compromise can reliably conjugate verbs to infinitive, and swap plurals back to singular a synonym swticher could do this:
nlp('i walked ecstatically').replace({ecstatic:'happy'}).out()
//i walked happily
you know? I haven't done this for perf reasons, but you could imagine building a clever method to do this - happy to help
If it could do that, it'd be INCREDIBLY cool. Imagine something like nlp('...').paraphrase()
. 💯
👍
@spencermountain Not sure how best to help do what you mentioned above...
Obviously not implemented.: -)
hey owen, it would involve looping through each word and conjugating all the verbs to infinitive, and all the plural nouns to singular.
If you save that string on each term, the replace method could just loop around and look at that string.
I don't wanna do that at tag-time. It would make everything slow.
but it would be a wicked plugin.
one method to create this doc.cache()
'super-normalized' word.
and another method to doc.replace({ecstatic:'happy'})
I would not call "bc" a synonym for "because", but more a slang version, or a contracted form. Just my 2 cents.
It'd be great, when normalizing text to include synonyms (and even antonyms!). So that all synonyms are normalized to the same word (and
not ${antonym}
too).It shouldn't be incredibly complex, it's equivalent to using
replace(synonym, normalized)
for each case (but much more optimized I hope.