timoschick / one-token-approximation

This repository contains the code for applying One-Token Approximation to a pretrained language model using subword-level tokenization.
https://arxiv.org/abs/1904.06707
11 stars 4 forks source link

One-Token Approximation

While traditional word embedding algorithms (e.g., Word2Vec, Glove) assign a single embedding to each word, pretrained language models (e.g., BERT, RoBERTa, XLNet, T5) typically represent words as sequences of subword tokens. For example, BERT represents the word strawberries as two tokens straw and ##berries. Transferring ideas and algorithms from traditional embeddings to contextualized embeddings may therefore raise questions like the following:

How would the embedding of "strawberries" (or any other multi-token word) look like in BERT's embedding space if the word was represented by a single token?

One-Token Approximation (OTA) can be used to answer this question. More details can be found here.

Dependencies

All dependencies can be found in environment.yml. If you use conda, simply type

conda env create -f environment.yml

to create a new environment with all required packages installed.

Usage

To obtain One-Token Approximations for multi-token words, run the following command:

python3 ota.py --words WORDS --output_file OUTPUT_FILE --model_cls MODEL_CLS --model MODEL --iterations ITERATIONS

where

For additional parameters, check the source code of ota.py or run python3 ota.py --help.

Citation

If you make use of One-Token Approximation, please cite the following paper:

@inproceedings{schick2020rare,
  title={Rare words: A major problem for contextualized representation and how to fix it by attentive mimicking},
  author={Schick, Timo and Sch{\"u}tze, Hinrich},
  url="https://arxiv.org/abs/1904.06707",
  booktitle={Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence},
  year={2020}
}