Closed bradfox2 closed 1 year ago
@DachengLi1 can explain this better. I guess you can use this argument in decode https://github.com/lm-sys/FastChat/blob/ea6c7b6da47d15d6e3264d0abba7b8d1090479a4/fastchat/serve/huggingface_api.py#L46
Thanks for the response. More concerned about training in a bunch of newlines when using the provided tokenizer.
Removing intermediate newlines from the output - or simply using the flan series of tokenizers works fine for decoding/inference.
Good point on the T5Tokenizer!
Firstly, we use the T5Tokenizer instead of T5TokenizerFast, and there is a difference (a HF discussion thread). If we use T5Tokenizer, and encode the sentence, we will find:
where 32106 is actually the whitespace, instead of newlines.
Lastly, we use T5Tokenizer to support decoding with special tokens. And @merrymercy is totally right, we have to add spaces_between_special_tokens=False to do this decoding. In particular, we treat whitespaces as a special token because the sentencepiece for T5 will treat consecutive whitespaces as a single one. This is not ideal if we want to output texts that are indent sensitive (e.g. codes). There is an issue on this.
@bradfox2 Let me know if there is any further question!
@DachengLi1 Thank you for the answer. I was not aware of that difference in standard vs Fast. Makes sense now.
Vicuna tokenizer has no extra '\n' characters. T5 tokenizer inserts them after each space.
Reproduce:
From T5: 'I\n am\n a\n dog\n and\n i\n dont\n like\n cats' Vicuna-Llama : 'I am a dog and i dont like cats'
This seems like a bug. Is there a purpose for this?