<unk> in input vocab - Githubissues

stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

Apache License 2.0

6.86k stars 1.51k forks source link

<unk> in input vocab #32

Closed DavidNemeskey closed 8 years ago

DavidNemeskey commented 8 years ago

Lines 225-226 of glove.c go like this:

// input vocab cannot contain special <unk> keyword
if (strcmp(word, "<unk>") == 0) return 1;

However, later:

if (use_unk_vec) {
  ...
}

I think this is wrong. If the input vocabulary contains the <unk> token, after potentially hours of waiting for glove to finish, the user ends up with a broken text file. The binary vectors will be intact IF the user asked for them, but even then, he might prefer text.

I see two ways of solving this problem:

Allow the user to have <unk> in the input vocabulary -- maybe by providing a switch or just

if (strcmp(word, "<unk>") == 0) use_unk_vec = 0;

If <unk> is not allowed, just warn the user after reading the vocabulary (or even in vocab-count)

ghost commented 8 years ago

I think your second solution

If is not allowed, just warn the user after reading the vocabulary (or even in vocab-count)

is a great idea. Exiting the program with a failed status and a message explaining the error would be a better behavior. Do you mind drafting a pull request with your suggested changes?

DavidNemeskey commented 8 years ago

No, of course not, though actually I would have voted for the first solution. I was just wondering if there was a reason why not allow <unk>. In my use-case, I needed explicit <unk>(s).

ghost commented 8 years ago

Well, let's see. Automatically swapping out the glove unk for the user unk iff a user unk exists is weird behavior because it's unpredictable, especially if you don't know exactly what's in the corpus. My expectation was that the user could pretty easily modify their corpus to swap out unk for another symbol of their choosing and then they don't have to go fiddling around with the glove internals to get their setup working.

So for example, the error message could read:

 'Error, <unk> vector found in corpus. Please remove <unk>s from your corpus (e.g. cat text8 | sed -e 's/<unk>/<raw_unk>/g' > text8.new)'

I think this is ultimately pretty easy for people to understand and execute on, so they won't have the same painful experience that you did. The other option is of course to let people opt in to using their own unks, but the error is thrown in glove.c and the warring in another file, so coordinating those two things for the process of opting in is messy. There are a few other ideas I had in mind, but they are all ugly as well.

ghost commented 8 years ago

Okay, I went ahead and added the warning, with the advised workaround. Thanks for reporting. See https://github.com/stanfordnlp/GloVe/commit/ce2285550c22460585eb433ad03c895941362762.