Add support for word Level tokenization

Observing the masking vocabularies published with the original work, it seems that they used whole words in thier masking, since all the ngrams are actual words, not tokens (all entries are full words)

We need to figure this out, if we even want to do that. we might be able to use a different tokenizer like https://huggingface.co/bert-large-uncased-whole-word-masking or use a custom whole-word tokenizer.

**I think of adding support for word-level tokenizer using https://huggingface.co/docs/tokenizers/components#models:~:text=WordLevel,tokens%20to%20IDs. and creating a costume tokenizer. I might need to train it from scratch.

I could probably use the huggingface tokenizers library to implement whole word tokenization, that will integrate smoothly to my code. Here are some useful links:

https://huggingface.co/docs/tokenizers/pipeline
https://huggingface.co/docs/tokenizers/components
training a tokenizer with dataset : https://github.com/huggingface/tokenizers/blob/main/bindings/python/examples/train_with_datasets.py
https://huggingface.co/docs/tokenizers/training_from_memory#using-the-datasets-library
https://huggingface.co/docs/transformers/fast_tokenizers -- for using transformers tokenizer with the pretrained tokenizer.

here are some useful links that I found (not huggingface):

https://pytorch.org/tutorials/beginner/torchtext_custom_dataset_tutorial.html -- using pytorch (If I can do it with transformers only -- better!)
https://www.tensorflow.org/text/guide/tokenizers?hl=en -- tensorflow, (If I can do it with transformers only -- better!)
https://spacy.io/ -- a fairly well known and efficient NLP package, might also be found useful

Another option is to use https://stackoverflow.com/questions/76040575/does-huggingface-have-a-model-that-is-based-on-word-level-tokens which refers to Word2Vec models that have a larger vocabulary...

https://github.com/huggingface/tokenizers/issues/553

I can also try to load use the vocabulary from spaCy with WordLevelTokenizer, might be useful.

about the wordlevel tokenization
- implement both training a tokenizer from scratch on the dataset in the load_and_tokenize_dataset function
- and implement a tokenizer with a prebiult vocabulary. I'm sure that I can find one. make those two an option.

I also found https://github.com/dwyl/english-words which is a nice list of english words.

But for the sake of simplicity, I think that I'm just gonna train the tokenizer on the dataset and use that.

02.07.2023 Support was added with training. Closed

shaigue / pmi_masking

Add support for word Level tokenization #30