Minimal code to train ELMo models in TensorFlow.
Heavily based on https://github.com/allenai/bilm-tf .
Most changes are simplifications and updating the code to the recent versions of TensorFlow 1. See also our repository with simple code to infer contextualized word vectors from pre-trained ELMo models.
python3 bilm/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT
where
$DATA
is a path to the directory containing 2 or more of (possibly gzipped) plain text files: your training corpus.
$SIZE
if the number of word tokens in $DATA (necessary to properly construct and log batches).
$VOCAB
is a (possibly gzipped) one-word-per-line vocabulary file to be used for language modeling; it should always contain at least \<S>, \</S> and \<UNK>.
$OUT
is a directory where the TensorFlow checkpoints will be saved.
Before training, please review the settings in bilm/train_elmo.py
. The most important are:
After the training, use the bilm/dump_weights.py
script to convert the checkpoints to and HDF5 model.
python3 bilm/dump_weights.py --save_dir $MODEL_DIR --outfile $MODEL_DIR/model.hdf5
Save your vocabulary file in the same directory.
Change the n_characters
value in the options.json
file from 261 to 262 to use the saved model for inference.
More details at https://github.com/allenai/bilm-tf