Roberta-specific tokenization?

mimno commented 4 years ago

https://github.com/marcotcr/checklist/blob/d26abda15cc433348dce7e559da6142b6fad6152/checklist/text_generation.py#L89

This line seems to assume Roberta-style tokenization, where there is a special character (G-dot) marking a token that occurs at the beginning of a word, but it fails for BERT-style tokenization, which uses a special sequence (##) to mark tokens not at the beginning of a word. It would also fail if the tokenizer is uncased (John -> john). I can't really see a way to fix it without knowing something about different model names, though.

marcotcr commented 4 years ago

I spent some time trying to do something general-purpose, but each tokenizer seems to have its own standard. I will change the case though, good catch.

marcotcr commented 4 years ago

Changed John -> john in ae99359

marcotcr / checklist

Roberta-specific tokenization? #24