Closed vinbo8 closed 4 years ago
Thanks! Let me first test it on real data, and then I will merge.
Nope, it throws errors at training time and fails:
in _build_word_char_embeddings
(1) Invalid argument: indices[21,1,1] = 23439 is not in [0, 261)
[[{{node lm/embedding_lookup}}]]
[[lm_1/gradients/lm_1/sampled_softmax_loss_1/embedding_lookup_grad/Cast/_337]]
Trying to find the number of characters, which would incorporate all we need, but still not be the total number of unicode code points (more than a million).
OK, after increasing the n_characters
parameter in train_elmo.py
to 256000 (from 261), it is training without errors. But the process is, like, 3 times slower than usual.
After it finishes, I will compare the performance of this model with the one trained in the regular way.
Maybe it's a good idea to just build a character vocabulary..
— v.
On 31 Mar 2020, at 13:11, Andrey Kutuzov notifications@github.com wrote:
OK, after increasing the n_characters parameter in train_elmo.py to 256000 (from 261), it is training without errors. But the process is, like, 3 times slower than usual. After it finishes, I will compare the performance of this model with the one trained in the regular way.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
OK, I tested a model trained this way. It trains much longer than usual and ends up to be marginally heavier in size (not surprisingly). But most important is that the resulting ELMo performs consistently worse at least on the WSD task (I tested on two different languages). Thus, I don't think it makes sense to mess with the original code in this way. Instead, I've committed a simple script which analyzes a prospective training corpus by calculating the average number of bytes per character. It will also predict the number of word types and word tokens which will be cropped at training time.
I know this is
master
but I think it should behave the same as it does now for Latin.