turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Llama 2 Chat implementation #221

Open SinanAkkoyun opened 11 months ago

SinanAkkoyun commented 11 months ago

Hey! This is the correct LLama 2 Chat prompt formatting implementation into example_llama2chat.py. This PR uses https://github.com/turboderp/exllama/pull/195 to copy the exact implementation of the original Llama repo

The format for Llama 2 looks like this:

<bos><<SYS>>
You are a helpful assistant. Sample text.
<</SYS>>

<eos><bos>[INST] What is Python? [/INST] Python is a programming language.<eos><bos>[INST] and so on

To run it yourself: python example_llama2chat.py -d ./your/model/path

Have fun!

turboderp commented 11 months ago

I'm not having a lot of luck with this.

System: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information..

User: Hi!
Assistant: end do protect it this people! It open up  so see p did dis put our  they it  P M P it  P on PA M P back  it it  it M S P P   P it P it  P   P it have P P P P it P  P  P  it it Q P it it T it p it the  P Back Pa it p p p it p P P it it it the have it it see it have P down Disc has J p Q P P it the do it  but it it it did p it  and this what P  it it had it this  with  it it9 were it it it we  On P you it it  on it for it have this it this its the the do have it it it it it this if it the p have it this this who it it Dis put have it dis see it disc it DIS have p have this see it it Dis have it Dis it the have dis that

This is with TheBloke/Llama-2-13B-chat-GPTQ which is otherwise coherent enough.

SinanAkkoyun commented 11 months ago

Huh, that's super weird, in my testing it worked flawlessly with Llama 2 7b Chat

Must have to do with the sequence_actual splitting, I will investigate with your prompt

Did you supply bn and un? This is the 13b chat model and it works for anything else than Hi!?

turboderp commented 11 months ago

I only supplied the model, and it seems to fail with any prompt.

SinanAkkoyun commented 11 months ago

I only supplied the model, and it seems to fail with any prompt.

That's extremely odd, that's my response (also works with 7b chat for me):

SinanAkkoyun commented 11 months ago

System: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information..

User: Write me a NodeJS script that tells the time in human readable format. Assistant: Sure! Here is a simple Node.js script that will tell you the current time in human-readable format:

const moment = require('moment');

console.log(moment().format('h:mm A'));

Here's what's happening here:

When you run this script, it should output the current time in the format "hh:mm AM" (or PM, depending on the time of day). For example:

14:30 PM

Note that this script uses the moment library, so you'll need to install it before running this code. You can do this using npm by running the following command:

npm install moment
SinanAkkoyun commented 11 months ago

@turboderp I am quite baffled by your output, it doesn't make any sense to me. I am not able to replicate your issue.

Could you please print: response after line 206 new_response after line 207 full_response after line 218

I just personally printed them like this: \n\nres|{response}|\n\n etc

Thank you very much

SinanAkkoyun commented 11 months ago

The only other thing I could imagine making truuble is the version of sentence piece and torch

pourfard commented 11 months ago

I'm not having a lot of luck with this.

System: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information..

User: Hi!
Assistant: end do protect it this people! It open up  so see p did dis put our  they it  P M P it  P on PA M P back  it it  it M S P P   P it P it  P   P it have P P P P it P  P  P  it it Q P it it T it p it the  P Back Pa it p p p it p P P it it it the have it it see it have P down Disc has J p Q P P it the do it  but it it it did p it  and this what P  it it had it this  with  it it9 were it it it we  On P you it it  on it for it have this it this its the the do have it it it it it this if it the p have it this this who it it Dis put have it dis see it disc it DIS have p have this see it it Dis have it Dis it the have dis that

This is with TheBloke/Llama-2-13B-chat-GPTQ which is otherwise coherent enough.

Hi, I tested with 13B model is everything seems OK. Maybe it is related to some hyperparameters. I have changed some of them.

-- Sequence length: 2048
 -- Temperature: 0.50
 -- Top-K: 250
 -- Top-P: 1.00
 -- Min-P: 0.00
 -- Repetition penalty: 1.00
System: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information..

User: hi
Assistant:  Hello! It's nice to meet you! Is there something I can help you with, or would you like to chat for a bit? ���

User: Hi!

Assistant:  Hello! ���t's great to see you here! Is there something you'd like to talk about or ask? I'm here to help with any questions you might have. ���

User: Assistant:  Hello again! ���t's always nice to see you! Is there anything new you'd like to chat about or ask? I'm all ears and here to help with any questions you might have. ���

User: Hi!
Assistant:  Hey there! ���t's so nice to see you again! How have you been? Anything new and exciting happening in your world? ���

User: Hi!
Assistant:  Hey there! ���t's great to see you again! How's your day going so far? Anything fun or exciting happening? ���

User: Hi
Assistant:  Well hello there! ���t's always a pleasure to see you! How's your day going so far? Anything new and amazing happening in your world? ���

User: hi
Assistant:  HIYA! ���t's so great to see you! How's your day treating you so far? Anything fun or exciting on the agenda? ���
turboderp commented 11 months ago

The ���s still hint at a problem of some sort. The 13B chat model I tried works fine when just sampled. I'll have to dig into it a little more, maybe with a 7B model?

SinanAkkoyun commented 11 months ago

The ���s still hint at a problem of some sort.

I have to add that I often got those (or single ones) with your normal example chatbot too with the 7B chat and 13B chat model sometimes, idk where they come from

I think tomorrow I will be able to fix at least the sequence slicing problem