juditacs / semeval

MathLing Budapest Team's repo
MIT License
10 stars 9 forks source link

way too many OOVs - need spell correction, etc. #6

Open recski opened 9 years ago

recski commented 9 years ago

"sledghammer" doesn't stand a chance

recski commented 9 years ago

@juditacs FYI similarities that exist for some token (not type):

1125675 both 187855 lsa 282376 machine 1143555 neither

i.e.: 41.7% of tokens are OOV in both sims and 59% are OOV in at least one, that's a LOT

recski commented 9 years ago

@juditacs this file has all oov-s, you can grep for 'ls' to see which ones were oov for your system: /home/recski/projects/sts/semeval/stats/oov_stats_141120.txt

there are also sequences of words, I'm not sure if they are supposed to get to the word_similarity function, e.g.: 2014-11-20 18:05:43,348 : align_and_penalize (526) - WARNING - OOV: (u'real number',), no lsa similarity

recski commented 9 years ago

@juditacs for spell correction I plan to do the following for now: for OOV-s, call hunspell, and try using the suggestions it returns. Of course this may turn out to be too greedy in the end, we'll see... For NE spelling variants, etc. that are likely to come up in the twitter data, Attila suggests that we build a spell corrector from the mentions data, using this code: https://github.com/zseder/hunmisc/tree/master/hunmisc/freebasealtnames I think you should decide if this is really a source of error on the twitter dataset when you inspect it manually. Let me know what you think.

juditacs commented 9 years ago

IMHO hunspell may be overkill for our purposes. If many OOVs remain, I suggest implementing the Norvig spell checker (about 20 lines of code).

So far I have only encountered one NE "altname", I'll look into the freebasealtnames if I find more.

recski commented 9 years ago

I'm nearly done, but the first runs suggest that you're right, hardly any words are effected, so I'll consider this low prio

On Mon, Nov 24, 2014 at 3:57 PM, Judit Acs notifications@github.com wrote:

IMHO hunspell may be overkill for our purposes. If many OOVs remain, I suggest implementing the Norvig spell checker (about 20 lines of code).

So far I have only encountered one NE "altname", I'll look into the freebasealtnames if I find more.

— Reply to this email directly or view it on GitHub https://github.com/juditacs/semeval/issues/6#issuecomment-64204719.

juditacs commented 9 years ago

Ok, noted. Maybe we should try a simple edit-distance based spell checker.

recski commented 9 years ago

lemmatization performed by Ocamorph+Hundisambig seems to cause a lot of issues: e.g. grassy is analyzed as grassy