mlfoundations / open_clip

An open source implementation of CLIP.
Other
10.39k stars 987 forks source link

Improve tokenizer decode #403

Open vturrisi opened 1 year ago

vturrisi commented 1 year ago

Right now the tokenizer decode method supports only a single instance at a time. I think it would be good to have batch_decode function and also support skip_special_tokens and clean_up_tokenization_spaces as in huggingface.

gpucce commented 1 year ago

@vturrisi I'll get to this as soon as I manage, what is the skip_special_tokens arg meant to do?

vturrisi commented 1 year ago

No worries @gpucce. It basically removes the sos and eos tokens and padding from the decoded string. https://huggingface.co/docs/transformers/main_classes/tokenizer