Extraneous newlines in lmsys/fastchat-t5-3b-v1.0 tokenizer

lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Apache License 2.0

36.63k stars 4.52k forks source link

Extraneous newlines in lmsys/fastchat-t5-3b-v1.0 tokenizer #1022

Closed bradfox2 closed 1 year ago

bradfox2 commented 1 year ago

Vicuna tokenizer has no extra '\n' characters. T5 tokenizer inserts them after each space.

Reproduce:

from transformers import (T5TokenizerFast, T5ForConditionalGeneration,  AutoTokenizer,LlamaTokenizer)
t = T5TokenizerFast.from_pretrained('lmsys/fastchat-t5-3b-v1.0')
text = 'I am a dog and i dont like cats'
t(text)
t.decode(t(text)['input_ids'])
t2 = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-delta-v0')
t2.decode(t2.encode(text))

From T5: 'I\n am\n a\n dog\n and\n i\n dont\n like\n cats' Vicuna-Llama : 'I am a dog and i dont like cats'

This seems like a bug. Is there a purpose for this?

merrymercy commented 1 year ago

@DachengLi1 can explain this better. I guess you can use this argument in decode https://github.com/lm-sys/FastChat/blob/ea6c7b6da47d15d6e3264d0abba7b8d1090479a4/fastchat/serve/huggingface_api.py#L46

bradfox2 commented 1 year ago

Thanks for the response. More concerned about training in a bunch of newlines when using the provided tokenizer.

Removing intermediate newlines from the output - or simply using the flan series of tokenizers works fine for decoding/inference.

DachengLi1 commented 1 year ago

Good point on the T5Tokenizer!

Firstly, we use the T5Tokenizer instead of T5TokenizerFast, and there is a difference (a HF discussion thread). If we use T5Tokenizer, and encode the sentence, we will find:

where 32106 is actually the whitespace, instead of newlines.

Lastly, we use T5Tokenizer to support decoding with special tokens. And @merrymercy is totally right, we have to add spaces_between_special_tokens=False to do this decoding. In particular, we treat whitespaces as a special token because the sentencepiece for T5 will treat consecutive whitespaces as a single one. This is not ideal if we want to output texts that are indent sensitive (e.g. codes). There is an issue on this.

@bradfox2 Let me know if there is any further question!

bradfox2 commented 1 year ago

@DachengLi1 Thank you for the answer. I was not aware of that difference in standard vs Fast. Makes sense now.