jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.58k stars 441 forks source link

How do you plan on dealing with hallucinations due to knowledge compression? #10

Open VatsaDev opened 1 year ago

VatsaDev commented 1 year ago

Hi, I'm very interested in this project, but I would like to know how you plan to deal with the amount of hallucinations made having a very high compression ratio, or training tokens to model params? 3T tokens to 1.1B is a far larger compression than 7B params to 2T tokens for llama2?

jzhang38 commented 1 year ago

Exploring retrieval augmented generation is on our TODO list!

VatsaDev commented 1 year ago

RAG would definitely help, but have you considered training the model on data similar to the SQUAD dataset, for familiarity with pulling factual answers from a context, so it would be better suited for RAG?

jzhang38 commented 1 year ago

Yes, we are currently reading papers about Retrieval Augmented LM to find out what training/adaptation setup to RAG is better suited for TinyLlama. It we be great if you could provide a pointer or something if you have an idea.

VatsaDev commented 1 year ago

RAG involves getting text data from documents or vector embeddings, which is great, but it won't work well for the basic text generation model this right now. when you make an official finetune, you would make a tinyLlama-chat version, and in that you could probably implement some training data like squad_v2, because then you could train it on chat data like

question: What is the biggest dinosaur egg ever found?
context: The largest known dinosaur eggs are those of Hypselosaurus priscus (`high ridge lizard'), a 12m (40ft) long titanosaurid which lived about 80 million years ago.
Answer: The largest known dinosaur eggs are those of Hypselosaurus priscus
walking-octopus commented 12 months ago

Perhaps something like Toolformer, with special tokens for intermediate tool use and its output, may be feasible.

VatsaDev commented 12 months ago

@walking-octopus Toolformer in the way you suggest it might work, but what do mean special tokens?

The steps are

artnoage commented 11 months ago

@VatsaDev Can you please give some references regarding your expectation about having more hallucination the more data you have? I understand that there are some heuristics (Chinchilla paper) about the right amount of data one needs to train a LLM of specific size, but why are you so sure that they are true (like more than just heuristics)?

VatsaDev commented 11 months ago

@artnoage I read a paper on arxiv, can't find the link unfortunately. Sorry If I come across as certain, I am referring to it in a similar way to the chinchilla paper, and I put the question like that as this was a couple weeks ago, when I thought saturation seemed more likely than it is now.

xiaoyunwu commented 11 months ago

Yes, we are currently reading papers about Retrieval Augmented LM to find out what training/adaptation setup to RAG is better suited for TinyLlama. It we be great if you could provide a pointer or something if you have an idea.

I think the main thing is instruction tuning first, and maybe add the encoding for multi-turn.

xiaoyunwu commented 11 months ago

https://github.com/yaodongC/awesome-instruction-dataset @jzhang38 Just in case you did not see this.

VatsaDev commented 11 months ago

@xiaoyunwu, Instruction tuning seems to be good, but one of the main features of TinyLlama is the context size, which I believe is 2048. That probably makes the model a good fit for few-shot/multi-shot instead of zero-shot, maybe even 32 shot, like a mini-gpt3. Do you know of any good datasets for this?

xiaoyunwu commented 11 months ago

instruction tuning is not zero-shot (prompt engineering can be).

VatsaDev commented 11 months ago

@xiaoyunwu Looking at the dataset, I see that its there

Luoyingfeng8 commented 10 months ago

@VatsaDev Can you please give some references regarding your expectation about having more hallucination the more data you have? I understand that there are some heuristics (Chinchilla paper) about the right amount of data one needs to train a LLM of specific size, but why are you so sure that they are true (like more than just heuristics)?

I also have the same doubt

VatsaDev commented 10 months ago

@Luoyingfeng8 I already responded to this for artonage, and I made this claim several months ago, since then, I've seen several instances of more trained tokens working for better models.

chadbrewbaker commented 9 months ago

You need to release a suffix array of the training corpus to do it properly. This is also useful in designing hypothetical copyright filters.