Closed vedantroy closed 1 week ago
Following up on this, this is a slightly patched version of the tokenizer in this repo (just made one of the methods an instance method instead of a module method):
tokenizer = clip_tokenizer.SimpleTokenizer(
bpe_path=str(vocab_path)
)
output = tokenizer.tokenize("hello world", context_length=77)
print(output.shape)
decoded = tokenizer.decode_tensor(output[0])
print(decoded)
And the decoded output is:
<start_of_text>hello world <end_of_text>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
I'm assuming the exclamation points at the end are fine?
Hi @vedantroy for the different special tokens I don´t know if there is a specific reason, for the exclamation marks, the reason they are there is that the tokenizer uses 0 as padding index but it is also the index assigned to the exclamation mark, unfortunately I think changing this would be a lot of work as all existing models work like this
fine as long as the right ones are used, the outputs are equivalent, side note I think the transformers clip tokenizer had/possibly still has an issue where the zero padding wasn't extended out to the end like the original openai.
Small discrepancy noticed between the 2 tokenizers:
<|startoftext|>
<start_of_text>
Not a huge difference, but figured I'd make sure there's no reason for this.