Vocabulary Distribution

rajicon commented 3 years ago

Hi,

I have been running this script on WWC vocabulary (minimum 100 frequency), and it takes very long. So, I was wondering if there was a difference in what was run here and my build, and whether I am doing something incorrectly?

121,663 more than one token (so ota needs to be applied). A lot of this is due to things like punctuation, so was this filtered out? Or maybe this is a tokenizing issue?

23260 one token words.

How does this breakdown match with the results in "Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking"? Is there something I should be doing differently?

timoschick commented 3 years ago

Hi @rajicon,

indeed we did perform word-level tokenization for generating the WWC-based vocabulary (I think we used this script here, if this is important for you I can look up the details). However, we ended up with a similar order of magnitude regarding the number of multiple-token words (>100,000). Unfortunately, OTA is indeed very slow and the script in this repository provides not option to parallelize it across multiple GPUs.

However, what you can do to speed it up if you have n > 1 GPUs available (and what we did for the paper) is to split the vocabulary into n evenly sized parts and run the ota.py script n times in parallel.

rajicon commented 3 years ago

Ok, I will do that. Thanks!

timoschick / one-token-approximation

Vocabulary Distribution #1