Right now, to test a new representation learner, one must:
Use the representation learner to write hidden state vectors for each token (or subword) to disk. (Better idea for subword models: decide how to combine subword representations; write resultant token embeddings to disk)
Run structural probe code on the hidden states as saved to disk.
This is "nice" in that the hidden states don't need to be computed at each pass (BERT is big/slow; I actually run most experiments on CPUs because the probe training is so fast and CPUs are so plentiful)
However, it's "not nice" that one can't swap representation model parameters on the fly, and especially that big huge vectors take up a lot of disk space (115GB for BERT-large on PTB WSJ train -- 40k sents.)
We'd like to enable easy swapping of new models by defining a new class in model.py. We'll need to read in the tokenizer (and perhaps subword-tokenizer) so we pass the model words as identified by its own vocabulary, and are able to map from subword reprs back to token reprs. There's also the problem of inefficiency of aligning subword reprs to token reprs at every batch
Right now, to test a new representation learner, one must:
This is "nice" in that the hidden states don't need to be computed at each pass (BERT is big/slow; I actually run most experiments on CPUs because the probe training is so fast and CPUs are so plentiful)
However, it's "not nice" that one can't swap representation model parameters on the fly, and especially that big huge vectors take up a lot of disk space (115GB for BERT-large on PTB WSJ train -- 40k sents.)
We'd like to enable easy swapping of new models by defining a new class in
model.py.
We'll need to read in the tokenizer (and perhaps subword-tokenizer) so we pass the model words as identified by its own vocabulary, and are able to map from subword reprs back to token reprs. There's also the problem of inefficiency of aligning subword reprs to token reprs at every batch