This line seems to assume Roberta-style tokenization, where there is a special character (G-dot) marking a token that occurs at the beginning of a word, but it fails for BERT-style tokenization, which uses a special sequence (##) to mark tokens not at the beginning of a word. It would also fail if the tokenizer is uncased (John -> john). I can't really see a way to fix it without knowing something about different model names, though.
I spent some time trying to do something general-purpose, but each tokenizer seems to have its own standard. I will change the case though, good catch.
https://github.com/marcotcr/checklist/blob/d26abda15cc433348dce7e559da6142b6fad6152/checklist/text_generation.py#L89
This line seems to assume Roberta-style tokenization, where there is a special character (G-dot) marking a token that occurs at the beginning of a word, but it fails for BERT-style tokenization, which uses a special sequence (
##
) to mark tokens not at the beginning of a word. It would also fail if the tokenizer is uncased (John
->john
). I can't really see a way to fix it without knowing something about different model names, though.