Closed DavidNemeskey closed 8 years ago
I think your second solution
If
is not allowed, just warn the user after reading the vocabulary (or even in vocab-count)
is a great idea. Exiting the program with a failed status and a message explaining the error would be a better behavior. Do you mind drafting a pull request with your suggested changes?
No, of course not, though actually I would have voted for the first solution. I was just wondering if there was a reason why not allow <unk>
. In my use-case, I needed explicit <unk>
(s).
Well, let's see. Automatically swapping out the glove unk for the user unk iff a user unk exists is weird behavior because it's unpredictable, especially if you don't know exactly what's in the corpus. My expectation was that the user could pretty easily modify their corpus to swap out unk for another symbol of their choosing and then they don't have to go fiddling around with the glove internals to get their setup working.
So for example, the error message could read:
'Error, <unk> vector found in corpus. Please remove <unk>s from your corpus (e.g. cat text8 | sed -e 's/<unk>/<raw_unk>/g' > text8.new)'
I think this is ultimately pretty easy for people to understand and execute on, so they won't have the same painful experience that you did. The other option is of course to let people opt in to using their own unks, but the error is thrown in glove.c and the warring in another file, so coordinating those two things for the process of opting in is messy. There are a few other ideas I had in mind, but they are all ugly as well.
Okay, I went ahead and added the warning, with the advised workaround. Thanks for reporting. See https://github.com/stanfordnlp/GloVe/commit/ce2285550c22460585eb433ad03c895941362762.
Lines 225-226 of glove.c go like this:
However, later:
I think this is wrong. If the input vocabulary contains the
<unk>
token, after potentially hours of waiting for glove to finish, the user ends up with a broken text file. The binary vectors will be intact IF the user asked for them, but even then, he might prefer text.I see two ways of solving this problem:
<unk>
in the input vocabulary -- maybe by providing a switch or just<unk>
is not allowed, just warn the user after reading the vocabulary (or even in vocab-count)