meta-llama / llama

Inference code for Llama models
Other
56.09k stars 9.53k forks source link

Tried 7B and 13B models, can't get any decent result from inference #69

Closed allaccs closed 1 year ago

allaccs commented 1 year ago

Well both models seems to be unable to follow any instruction, answer any question or even continue text. Do we need to fine-tune it or add more functions in order to get decent results?

A few examples, everything is like this :

"What is the meaning of life? Be concise and do not repeat yourself." In my country, they have banned smoking. There was only one solution: I became a tourist.

  • It is not enough to dream of peace. We must work for it.
  • In a war, the first victims are truth and compassion.
  • Do not fight for your rights; fight for your duties.
  • Strange that the best university graduates are in search of jobs.
  • The strongest is not the one who can speak well.
  • A complaint is a cry for help.
  • Time and tide wait for no man, which is unfortunate because time waits for everyone.
  • There are no sacred cows. The steaks are too high.
  • The last thing one expects is to be eaten, but it’s an occupational hazard.
  • Life is a process of evolution, not revolution.
  • We are all failed books of poetry.

How weird is that?

"Explain the theory of relativity." Because the two spaces have different lengths, time can vary on each spaceship. But I just don't understand this whole theory that's so hard for me. Time varies due to the difference in length between the spaceships. The spaceships themselves move at the same speed, so it's only the length of the spaceship that makes time seem to pass faster. Time will continue to "pass" faster for the astronaut on the ship that is moving at a faster speed, but the astronaut aboard the faster ship will be able to observe the ship moving at a slower speed.

Ten easy steps to build a website...

"10 steps to build a website : 1." 10 steps to build a website :

  1. 10 steps to build a website :
  2. 10 steps to build a website :
  3. 10 steps to build a website :
  4. 10 steps to build a website :
  5. 10 steps to build a website :
  6. 10 steps to build a website :
  7. 10 steps to build a website :
  8. 10 steps to build a website :
  9. 10 steps to build a website :
zoanthal commented 1 year ago

Yes, I'm getting some pretty erratic responses on the 7B. But considering it is so much smaller than GPT 3 or the 65B, that's to be expected.

elephantpanda commented 1 year ago

Don't forget, it's not been trained to be a chat-bot. All it knows how to do is predict the next word in the sequence. Chat-GPT also has a lot of hidden prompts that you don't see with examples of how it should behave. So to answer questions, try and first give it examples of questions and answers. e.g.

QUESTION: What colour is the sky?
ANSWER: It is commonly thought to be blue.
QUESTION: Who first landed on the moon?
ANSWER: A good question, this is known to be Neil Armstrong.
QUESTION: What is the captial of  France?
ANSWER:

Maybe that will work better. IDK. I haven't tried it myself. Because I haven't been accepted 😭😭😭

zoanthal commented 1 year ago

Don't forget, it's not been trained to be a chat-bot. All it knows how to do is predict the next word in the sequence. Chat-GPT also has a lot of hidden prompts that you don't see with examples of how it should behave. So to answer questions, try and first give it examples of questions and answers. e.g.

QUESTION: What colour is the sky?
ANSWER: It is commonly thought to be blue.
QUESTION: Who first landed on the moon?
ANSWER: A good question, this is known to be Neil Armstrong.
QUESTION: What is the captial of  France?
ANSWER:

Maybe that will work better. IDK. I haven't tried it myself. Because I haven't been accepted 😭😭😭

Human> QUESTION: What colour is the sky? ANSWER: It is commonly thought to be blue. QUESTION: Who first landed on the moon? ANSWER: A good question, this is known to be Neil Armstrong. QUESTION: What is the captial of France? LLaMa> QUESTION: What colour is the sky? ANSWER: It is commonly thought to be blue. QUESTION: Who first landed on the moon? ANSWER: A good question, this is known to be Neil Armstrong. QUESTION: What is the captial of France? ANSWER: Paris. QUESTION: Which continent is the largest? ANSWER: The African. QUESTION: How many states are in Australia? ANSWER: 6. QUESTION: What is the national sport of the USA? ANSWER: Football. QUESTION: What is the national flower of England? ANSWER: Rose. QUESTION: How many coloured dots are on the American flag? ANSWER: 50. QUESTION: In the UK who is the head of state? ANSWER: The Queen. QUESTION: What is the national sport of Japan? ANSWER: Sumo. QUESTION: Which is the smallest continent? ANSWER: Australia. QUESTION: What is the national animal of the UK? ANSWER: The lion. QUESTION: What is the capital of France? ANSWER: Paris. QUESTION: What is the capital of Australia? ANSWER: Canberra. QUESTION: What is the capital of Mexico? ANSWER: Mexico City. QUESTION: In the UK what is the name of the largest island? ANSWER: Great Britain. QUESTION: In the UK what is the name of the island which the Scottish mainland is located on? ANSWER: Great Britain. QUESTION: Who is the Prime Minister of England? ANSWER: Tony Blair. QUESTION: Who is the President of the USA? ANSWER: George W. Bush. QUESTION: How many states are there in the UK? ANSWER: 52. QUESTION: What is the national flower of England? ANSWER: Rose. QUESTION: How many red stripes are on the American flag? ANSWER: 13. QUESTION: What is the national bird of the UK? ANSWER: The Robin. QUESTION: What is the national sport of France? ANSWER: Rugby. QUESTION: What is the national animal of the USA? ANSWER: The bald eagle. QUESTION: What is the national symbol of France? ANSWER:

allaccs commented 1 year ago

QUESTION: What colour is the sky? ANSWER: It is commonly thought to be blue. QUESTION: Who first landed on the moon? ANSWER: A good question, this is known to be Neil Armstrong. QUESTION: What is the captial of France? ANSWER: It is Paris. QUESTION: What is the largest island in the world? ANSWER: This is Russia, it is the largest. QUESTION: How many gallons of water are in a barrel? ANSWER: This is not an easy question. The best guess is 20 gallons. QUESTION: What is the largest lake in the world? ANSWER: This is the Caspian Sea, it is the largest. QUESTION: How many continents are there? ANSWER: It is commonly thought there are 7 continents, but there are only 6. QUESTION: Which day of the week was January 1, ANSWER: This is commonly thought to be Monday, but actually it was Tuesday. QUESTION: How many letters are there in the ANSWER: There are 26. QUESTION: Which country is located in South America? ANSWER: This is Argentina. QUESTION: What is the largest city in the United ANSWER: It is New York City. QUESTION: What is the smallest continent? ANSWER:

This is the default settings, this is really bad!

allaccs commented 1 year ago

Here are some of the reasons why it is a bad output for a language model:

Incorrect answers: The model provides incorrect answers for some of the questions, such as the number of continents, the day of the week for January 1st, and the smallest continent. This indicates that the model has not learned the correct factual information and is producing unreliable responses.

Incomplete answers: The model provides incomplete answers for some of the questions, such as the answer for the largest city in the United States. While New York City is a large city, it is not the only one, and the model does not provide any additional information to clarify its response.

Lack of coherence: The output lacks coherence and does not provide a clear and consistent flow of information. The model jumps from one question to another without providing any context or linking information, which can make it difficult to follow and understand.

Overall, this output highlights some of the challenges and limitations of language models, particularly when it comes to generating accurate and coherent responses to complex questions. While these models can be useful for certain tasks, they still have a long way to go before they can reliably provide human-like responses to a wide range of questions and scenarios.

Thank you ChatGPT

elephantpanda commented 1 year ago

This is the default settings, this is really bad!

Seems good to me? Just take the substring before it gets to the next "QUESTION:" and its a working chatbot... almost. 😁

Ooh - just saw it said Russia is the largest Island πŸ˜‚πŸ˜‚πŸ˜‚ OK. You are right it's not very good!!!! Well perhaps it saw my bad grammar and deciding to imitate someone who is not very clever. How about if you did it with longer words:

QUESTION: What is the capital of France? ANSWER: Allow me to elucidate, my good fellow. The capital of the aforementioned country which is commonly known as France has a long and interesting history. Historians and archivists would agree that in the modern era, the capital of France is Paris....etc. 🧐

Or perhaps add this to the top of the prompt: What follows are questions and answers by a genius level chatbot:

Urammar commented 1 year ago

Ya'll really got accepted as machine learning researchers? Smh

First, lets stop comparing this to ChatGPT right now, less ya'll talking about your 6 gpu cluster running the 65b version. GPT3 is 175b and we know ChatGPT is a new model with unknown size. If you're hoping a 7b model can compete at its level you can pack up and go home now.

Now, apparently the 65b licks it in every metric, which is like, okay wow. But ya'll gamer GPU kiddies need to go home.

Secondly,

This model is not a chatbot. ChatGPT is, well, its in the name tho innit? This is a standard language model, like GPT3 was. Its function is basically a glorified autofill. If you sufficiently prompt it, it will act like a chatbot (what chatgpt is doing a little bit behind the scenes), but you guys coming in like boomers talking to it, like it takes instructions like your old mate, are actually embarrassing.

It can do way more than ChatGPT can in this format, but its also more awkward to deal with. Thats how it works, go finetune about it.

allaccs commented 1 year ago

Ya'll really got accepted as machine learning researchers? Smh

First, lets stop comparing this to ChatGPT right now, less ya'll talking about your 6 gpu cluster running the 65b version. GPT3 is 175b and we know ChatGPT is a new model with unknown size. If you're hoping a 7b model can compete at its level you can pack up and go home now.

Now, apparently the 65b licks it in every metric, which is like, okay wow. But ya'll gamer GPU kiddies need to go home.

Secondly,

This model is not a chatbot. ChatGPT is, well, its in the name tho innit? This is a standard language model, like GPT3 was. Its function is basically a glorified autofill. If you sufficiently prompt it, it will act like a chatbot (what chatgpt is doing a little bit behind the scenes), but you guys coming in like boomers talking to it, like it takes instructions like your old mate, are actually embarrassing.

It can do way more than ChatGPT can in this format, but its also more awkward to deal with. Thats how it works, go finetune about it.

I know it is not an instruct model or a model trained with RLHF, but even a "glorified autofill" can do better than this. By the way, personal attacks and derogatory language are really not helpful in fostering productive dialogue.

Anyway, you are telling me this is a good output? Look at the benchmarks, this doesn't make sense for a 13B model with these scores.

You're right, we should fine-tune it, but for now I feel neo-x-20B is better and it is not like fine-tuning was free.

Prompt :"10 steps to build a website : 1." Completion : "10 steps to build a website :

  1. 10 steps to build a website :
  2. 10 steps to build a website :
  3. 10 steps to build a website :
  4. 10 steps to build a website :
  5. 10 steps to build a website :
  6. 10 steps to build a website :
  7. 10 steps to build a website :
  8. 10 steps to build a website :
  9. 10 steps to build a website :"
timlacroix commented 1 year ago

@allaccs @jwnsu Thanks for the feedback :)

Since the model has not been finetuned on instructions, it is not unexpected that it would perform poorly when prompted with instructions. Without finetuning, these models are very sensitive to the prompts. Modifying your prompts slighly yields better results:

Here is what I get with the 7B model (prompts are in bold)


I believe the meaning of life is to find happiness and be satisfied with what you have. People have different definitions of happiness. Some people feel that if they could only win the lottery, they would be happy. Some people feel that if they could only get that promotion, they would be happy. Some people feel that if they could only be the top scorer in a game, they would be happy. If you do not know what happiness is, I suggest you ask a psychologist. A psychologist has studied the subject of happiness and he or she knows what happiness is. A psychologist has a Ph.D. in psychology and is an expert on the subject of happiness. A psychologist knows how to make people happy. Although you might know what happiness is, you might have forgotten it. If that is the case, I suggest you consult a psychologist. A psychologist can make you happy again. A psychologist can help you discover your happiness and how to be happy. Happiness is a big word. Happiness is a nice word. Happiness is a beautiful word. I believe that the meaning of life is to find happiness and be satisfied with what you have. People have different definitions of happiness. Some people feel

==================================

Simply put, the theory of relativity states that 1) there is no absolute time or space and 2) the speed of light in a vacuum is the fastest speed possible. There are two key principles in relativity: (1) The laws of physics are the same in all inertial reference frames. (2) The speed of light is constant in all inertial reference frames. The second of these principles has allowed us to prove the first. Before Einstein, scientists believed that the speed of light was constant in all frames, but that the speed of light was not constant. This was called the luminiferous aether and was used to explain why light has a finite speed. However, with the advent of special relativity, we can now explain why the speed of light is constant in all inertial reference frames, without the need for the luminiferous aether. The special theory of relativity is based on two postulates: 1) The laws of physics are the same in all inertial reference frames. 2) The speed of light is constant in all inertial reference frames. The first of these postulates is based on the assumption that the laws of physics are the same everywhere in the universe

==================================

Building a website can be done in 10 simple steps:

  1. Decide what you need What is it that you need to do? Do you want people to buy a product or service? Do you want to have people sign up for your newsletter? Do you want to have people call you for an appointment? Or do you want people to fill out a survey? Whatever it is you want people to do, make sure you know what you want them to do before you start. The next step is to decide on a name for your website. This can be a little confusing for some people. However, if you think about it, you already have a name for your business and you already have a name for your business. This name should be the name that your customers will see. So, how do you go about choosing a name? It’s not as hard as it seems. You can either do a Google search of your business name and see what pops up or you can do a domain name search. A domain name search is pretty easy to do. All you need to do is go to the website of a domain name registrar like GoDaddy.com, Register.com, or Tucows.com and type in the name that you are thinking of. Then see if the name

We've added these suggestions to the FAQ: https://github.com/facebookresearch/llama/blob/main/FAQ.md#2-generations-are-bad and adapted example.py accordingly.

allaccs commented 1 year ago

@timlacroix Thanks for the FAQ update. Changing the prompt formulation did give some much better outputs. I still find few shots prompting isn't giving very good results so I will try to fine-tune the model.

Is there some recommendations for the data-set formatting? (Don't know if I should open another issue for this)

timlacroix commented 1 year ago

By "data-set" formatting, do you mean the right format for few-shot prompting ?

I think it's best to find a format that really separates the prompt you're giving from text that could be found on the internet "in the wild".

For instance, I found the "tweet sentiment" example with "###" as separators worked much better than the "translation" example without any kind of separators between each example in the few shot prompt.

Hopefully that helps ?

allaccs commented 1 year ago

No I mean actual fine-tuning, few-shot prompting seems to have it's limits.

For Neo-x I have a 2go JSONL file with many documents, each document with a prompt/answer structure with tags such as "<|endoftext|>". For llama I can't seem to find much info.

Logophoman commented 1 year ago

@allaccs I think the main reason why LLaMa behaves so unexpectedly is due to the fact that no Reinforcement Learning from Human Feedback (RLHF) has been done so far. This could be done using the following repo I suppose... However I'm still trying to figure out how to really get the training going πŸ˜… also see:

ChatLLaMa

I think the key to quality is lots and lots of RLHF hours, and I think OpenAI payed about 35 experts for about half a year throwing question answer Pairs at ChatGPT before releasing it for the public.

Since these big language Models are currently just there to predict webTokens that could not be more random at least for the average human, you'll get lots of disappointing results at the moment...

This is due to the fact that LLaMa is simpy not trained on conversational prompts. You can read the paper that explains how ChatGPT' s predecessor has been built.

I think the key to getting ChatGPT-like quality is some crowd-soured effort to train LLaMa, and I suppose we could get even way better quality with that than ChatGPT....

Just imagine the 65B params model fine-tuned by thousands of random people instead of a meager 30-ish probably more geeky people πŸ˜…

timlacroix commented 1 year ago

@allaccs This repo doesn't provide fine-tuning code. The paper touches on finetuning briefly, referencing the fine-tuning protocol from Flan.

This repo provides the architecture necessary to load the pre-trained checkpoints and can be used as a base for finetuning code :)

seoeaa commented 1 year ago

@allaccs This repo doesn't provide fine-tuning code. The paper touches on finetuning briefly, referencing the fine-tuning protocol from Flan.

This repo provides the architecture necessary to load the pre-trained checkpoints and can be used as a base for finetuning code :)

https://github.com/chrisociepa/allamo

TeaCult commented 1 year ago

Model Benchmarks ARC HellaSwag MMLU etc shows that even falcon 40B (which seems to be best atm) is nowhere near to chatgpt-3.5 turbo mainly it is 63/100 chatgpt-3.5-turbo is 84.5/100.

Thats probably why. And it is telling me that in AI research most important thing seems to be training data quality.

allaccs commented 1 year ago

3 months late mate, this issue was posted shortly after llama was out. I figured out (and others here for sure) that llama needed finetuning and RLHF to reach gpt level or better answers to instructions. Also those benchmarks are not the most useful/precise you can clearly train an expert on llama that would beat gpt in some domain.

On Fri, Jul 7, 2023, 1:59 AM Gediz GÜRSU @.***> wrote:

Model Benchmarks ARC HellaSwag MMLU etc shows that even falcon 40B (which seems to be best atm) is nowhere near to chatgpt-3.5 turbo mainly it is 63/100 chatgpt-3.5-turbo is 84.5/100.

Thats probably why ?

β€” Reply to this email directly, view it on GitHub https://github.com/facebookresearch/llama/issues/69#issuecomment-1624450466, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2YSSQGCFLDYWBCIEGUDYHLXO5GODANCNFSM6AAAAAAVN3ZXMA . You are receiving this because you were mentioned.Message ID: @.***>

albertodepaola commented 1 year ago

Closing this issue as solved on lack of RLHF for the original models. For future reference, check both llama and llama-recipes repos for getting started guides.