Use > 256 code points - Githubissues

ltgoslo / simple_elmo_training

Minimal code to train ELMo models in recent versions of TensorFlow

Apache License 2.0

14 stars 3 forks source link

Use > 256 code points #2

Closed vinbo8 closed 4 years ago

vinbo8 commented 4 years ago

I know this is master but I think it should behave the same as it does now for Latin.

akutuzov commented 4 years ago

Thanks! Let me first test it on real data, and then I will merge.

akutuzov commented 4 years ago

Nope, it throws errors at training time and fails:

in _build_word_char_embeddings
  (1) Invalid argument: indices[21,1,1] = 23439 is not in [0, 261)
         [[{{node lm/embedding_lookup}}]]
         [[lm_1/gradients/lm_1/sampled_softmax_loss_1/embedding_lookup_grad/Cast/_337]]

Trying to find the number of characters, which would incorporate all we need, but still not be the total number of unicode code points (more than a million).

akutuzov commented 4 years ago

OK, after increasing the n_characters parameter in train_elmo.py to 256000 (from 261), it is training without errors. But the process is, like, 3 times slower than usual. After it finishes, I will compare the performance of this model with the one trained in the regular way.

vinbo8 commented 4 years ago

Maybe it's a good idea to just build a character vocabulary..

— v.

On 31 Mar 2020, at 13:11, Andrey Kutuzov notifications@github.com wrote:

OK, after increasing the n_characters parameter in train_elmo.py to 256000 (from 261), it is training without errors. But the process is, like, 3 times slower than usual. After it finishes, I will compare the performance of this model with the one trained in the regular way.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

akutuzov commented 4 years ago

OK, I tested a model trained this way. It trains much longer than usual and ends up to be marginally heavier in size (not surprisingly). But most important is that the resulting ELMo performs consistently worse at least on the WSD task (I tested on two different languages). Thus, I don't think it makes sense to mess with the original code in this way. Instead, I've committed a simple script which analyzes a prospective training corpus by calculating the average number of bytes per character. It will also predict the number of word types and word tokens which will be cropped at training time.