Open recski opened 10 years ago
@juditacs FYI similarities that exist for some token (not type):
1125675 both 187855 lsa 282376 machine 1143555 neither
i.e.: 41.7% of tokens are OOV in both sims and 59% are OOV in at least one, that's a LOT
@juditacs this file has all oov-s, you can grep for 'ls' to see which ones were oov for your system: /home/recski/projects/sts/semeval/stats/oov_stats_141120.txt
there are also sequences of words, I'm not sure if they are supposed to get to the word_similarity function, e.g.: 2014-11-20 18:05:43,348 : align_and_penalize (526) - WARNING - OOV: (u'real number',), no lsa similarity
@juditacs for spell correction I plan to do the following for now: for OOV-s, call hunspell, and try using the suggestions it returns. Of course this may turn out to be too greedy in the end, we'll see... For NE spelling variants, etc. that are likely to come up in the twitter data, Attila suggests that we build a spell corrector from the mentions data, using this code: https://github.com/zseder/hunmisc/tree/master/hunmisc/freebasealtnames I think you should decide if this is really a source of error on the twitter dataset when you inspect it manually. Let me know what you think.
IMHO hunspell may be overkill for our purposes. If many OOVs remain, I suggest implementing the Norvig spell checker (about 20 lines of code).
So far I have only encountered one NE "altname", I'll look into the freebasealtnames if I find more.
I'm nearly done, but the first runs suggest that you're right, hardly any words are effected, so I'll consider this low prio
On Mon, Nov 24, 2014 at 3:57 PM, Judit Acs notifications@github.com wrote:
IMHO hunspell may be overkill for our purposes. If many OOVs remain, I suggest implementing the Norvig spell checker (about 20 lines of code).
So far I have only encountered one NE "altname", I'll look into the freebasealtnames if I find more.
— Reply to this email directly or view it on GitHub https://github.com/juditacs/semeval/issues/6#issuecomment-64204719.
Ok, noted. Maybe we should try a simple edit-distance based spell checker.
lemmatization performed by Ocamorph+Hundisambig seems to cause a lot of issues: e.g. grassy is analyzed as grassy
"sledghammer" doesn't stand a chance