lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.65k stars 4.52k forks source link

Why are the results worse when using HuggingFace model implementations vs. local? #2089

Open MarinaWyss opened 1 year ago

MarinaWyss commented 1 year ago

Hi, I have a very basic question:

I would like to download Vicuna from HuggingFace (e.g. this model) and use it to ask arbitrary questions, like "What topics are discussed in this text?" or "Summarize what happened in this text." If I try this in the GUI I get reasonable answers to all of my questions. I have also tried downloading the weights via FastChat and that works as expected, too.

But, when I try the same thing using HuggingFace versions, the quality is much worse. Rather than respond to the questions as I would expect, it typically just outputs the input text and some additional generated text, which is not what I'm after.

Here is some example code:

from transformers import LlamaTokenizer, AutoModelForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("lmsys/vicuna-7b-v1.3", legacy=False)
model = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.3")

input_text = """
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: In one sentence, describe what happened in this video transcript: Hey everyone, hope you're having a nice day. I'm looking forward to playing piano later, but how are you all? Ha ha, that's nice. Ok, let's get started. This is one of my favorite songs. Oh yeah that sounds fun.

ASSISTANT:
"""

input_ids = tokenizer(input_text, return_tensors="pt")
out = model.generate(input_ids['input_ids'], max_new_tokens=100)
result = tokenizer.decode(out[0])

print(result)
"<s> A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: In one sentence, describe what happened in this video transcript: Hey everyone, hope you're having a nice day. I'm looking forward to playing piano later, but how are you all? Ha ha, that's nice. Ok, let's get started. This is one of my favorite songs. Oh yeah that sounds fun. I'm gonna play it now. Wow, that was really good. I'm so happy. I love playing music. I'm so glad I have this channel. I'm so grateful for all of you. I'm so happy to be alive. I love you all.</s>"

If anyone can point me in the right direction/offer advice I would be super grateful. I feel like I must be missing something obvious. Thank you!

surak commented 1 year ago

What do you mean by "using huggingface versions"? You mean having the inference done on their side? I use these models by cloning their repo, and they work just fine, just like it seems you do, too.

MarinaWyss commented 1 year ago

If I download FastChat and the model weights directly and use the CLI to interact with the model (the same one here), the results are vastly superior.

Basically, when I use an EC2 instance with the exact same specs to run the downloaded weights via the CLI I get good results, but when I use that same instance in a SageMaker notebook with this code the results are not nearly as good.

I'm just curious about why that would be.