shaigue / pmi_masking

This repository contains code that takes a text corpus and creates a PMI masking vocabulary for it.
MIT License
1 stars 0 forks source link

Add support for word Level tokenization #30

Open shaigue opened 1 year ago

shaigue commented 1 year ago

Observing the masking vocabularies published with the original work, it seems that they used whole words in thier masking, since all the ngrams are actual words, not tokens (all entries are full words)

We need to figure this out, if we even want to do that. we might be able to use a different tokenizer like https://huggingface.co/bert-large-uncased-whole-word-masking or use a custom whole-word tokenizer.

**I think of adding support for word-level tokenizer using https://huggingface.co/docs/tokenizers/components#models:~:text=WordLevel,tokens%20to%20IDs. and creating a costume tokenizer. I might need to train it from scratch.

I could probably use the huggingface tokenizers library to implement whole word tokenization, that will integrate smoothly to my code. Here are some useful links:

here are some useful links that I found (not huggingface):

Another option is to use https://stackoverflow.com/questions/76040575/does-huggingface-have-a-model-that-is-based-on-word-level-tokens which refers to Word2Vec models that have a larger vocabulary...

https://github.com/huggingface/tokenizers/issues/553

I can also try to load use the vocabulary from spaCy with WordLevelTokenizer, might be useful.


I also found https://github.com/dwyl/english-words which is a nice list of english words.

But for the sake of simplicity, I think that I'm just gonna train the tokenizer on the dataset and use that.

02.07.2023 Support was added with training. Closed