meta-llama / llama

Inference code for Llama models
Other
56.34k stars 9.56k forks source link

Fine tuning on a collection of text documents #730

Open mathav95raj opened 1 year ago

mathav95raj commented 1 year ago

Is it right to expect llama to be fine tuned to knowledge in the form of unstructured text data from a proprietary text document (chunking text in fixed lengths with langchain text splitter and fine tuning)? I understand that there are embedding based similarity search to retrieve relevant responses, but would it be possible for llama to absorb the knowledge from a new document that it has never seen? My question is similar to this

I kept the instruction column fixed with just a simple statement ('This is a useful information from ') and the output are the chunks at each row of the dataset. The expectation is when a question is asked from the PDF llama should answer it. With my experiment, llama is able to answer better than before fine tuning but still it is hallucinating a lot. Is my approach even valid? If it is valid what would be the right way to prepare data for fine tuning?

amitsangani commented 1 year ago

Yes, you should be able to vectorize and create embeddings for external data and store it in a similarity search engine (e.g. FAISS). You can use the input prompt along with additional context from vector db as input to Llama. Again, there is no guarantee that it will not hallucinate, but you will be able to get relevant responses. Yes, langchain makes it easier.

mathav95raj commented 1 year ago

Thanks for the response @amitsangani

Would it be possible to do this without similarity search and only relying on fine tuned LLM?

I understand that there are embedding based similarity search to retrieve relevant responses, but would it be possible for llama to absorb the knowledge from a new document that it has never seen? My question is similar to this

amitsangani commented 1 year ago

Yes. You should be able to do with a fine-tuned model as well. The issue is that fine-tuning a model is usually time and cost intensive. If you data is constantly changing, you can use combination of fine-tuning and prompt engineering, where most recent data is fed as a context through a vector DB as a prompt to Llama and your domain specific data is already fine-tuned.