Using llama for ConversationalRetrievalChain

tgcandido commented 1 year ago

I'm currently using OpenAIEmbeddings and OpenAI LLMs for ConversationalRetrievalChain. I'm trying to switch to LLAMA (specifically Vicuna 13B but it's really slow. I've done this:

embeddings = LlamaCppEmbeddings(model_path="/Users/tgcandido/dalai/llama/models/7B/ggml-model-q4_0.bin") llm = LlamaCpp(model_path="/Users/tgcandido/dalai/alpaca/models/7B/ggml-model-q4_0.bin")

I could use different embeddings (OpenAIEmbeddings + LlammaCpp ?), but I don't know if that's a good match - I don't know much about embedding compatibility.

Another idea is to use a LlamaCpp model in a "REST" mode that is loaded once, and I can send many requests because for each prompt, the executable is run, and It takes ~10s to load the executable on my M2 Max.

Are my hypothesis correct, or am I missing something here?

digitake commented 1 year ago

So what do you really want to achieve here?

Speed - OpenAI might offers the best speed performance compared to Vicuna model which is run locally.
Price/Data Privacy - Vicuna is good.

Embedding can be done once and save into file. You can use Chroma to create persistence Vectorstore, load it back up on running. Re-indexing them once in a while.

tgcandido commented 1 year ago

Got it.

As for my goal, I want to be free from OpenAI.

I can run and persist the embeddings, but I'm not sure if I need to use the llama model for creating embeddings(which is slow - at least for the embeddings that I'm trying to generate).

As for inferencing, it seems like the llama.so file is opened for every prompt, and just for the executable to start takes around ~10s. I think I want to achieve a one-time initialization of llama that can serve multiple prompts.

Qualzz commented 1 year ago

i've tried huggingface embeddings + llama and it's promising !

tgcandido commented 1 year ago

@Qualzz which embeddings did you use?

digitake commented 1 year ago

Got it.

As for my goal, I want to be free from OpenAI.

I can run and persist the embeddings, but I'm not sure if I need to use the llama model for creating embeddings(which is slow - at least for the embeddings that I'm trying to generate).

As for inferencing, it seems like the llama.so file is opened for every prompt, and just for the executable to start takes around ~10s. I think I want to achieve a one-time initialization of llama that can serve multiple prompts.

LLaMa.cpp is slow because it is designed to be able to execute on CPU. Your option might be either:

Use OpenAI ada-002 to create embedding once(assume your data doesn't change), save to Vectorstore and use that.
Implement batch version of embedding(Use original LLAMA that run on GPU, put it on the cloud somewhere)

The embedding model and the LLM model doesn't necessary to be the same(to my knowledge). Because mostly we use embedding to transform [text -> vector(aka. list of number)]. To retrieve it back, yes, the same embedding model must be used to generate two vector and compare their similarity.

The LLM will be fed with the data retrieved from embedding step in the form of text. i.e. The LLM model contains its own embedding step internally.

sime2408 commented 1 year ago

@tgcandido Did you find a solution to this? I am also experiencing prolonged injection using LlamaCppEmbeddings. What are the alternatives to GPT models like gpt4all, vicuna, etc?

Free-Radical commented 1 year ago

@tgcandido @Qualzz @sime2408 @tgcandido I hate the concept of having to pay OpenAI or any ohter closed model. I am struggling with getting the basic llama 7b model utilizing llam.cpp to train for a small test document (i'm new to this :( ) , I read langchain info but am a little confused, could one of pls tell me or point me how to use langchain and a llama.cpp based model to load and query a small text document? A code example showing how would be greatly appreciated!

sime2408 commented 1 year ago

@Free-Radical not sure which model you use but according to what other guys already replied:

if you run locally then it's advisable to use Vicuna 7B, not Vicuna 13B.
use embeddings calculated somewhere on GPU like OpenAI does to speed things up, doesn't have to be OpenAI, hugging face is an option too. Save it to a file for later use. If it's taking too much time I guess you can chunk upsert it by just storing piece by piece of your txt chunked array: documents=txt_documents[0:32].
if you still want to tune it locally maybe you can play with num of threads, batch, etc, like this: LlamaCpp(model_path=GPT_MODEL_PATH, callback_manager=callback_manager, n_ctx = 1024, n_threads=64, n_batch=1024)
@tgcandido mentioned that llama.so file is opened for every prompt, I am not sure if this cache helps??
```
import langchain
```

from langchain.cache import InMemoryCache langchain.llm_cache = InMemoryCache()


- @digitake This statement is true: LLM model doesn't necessarily to be the same(to my knowledge)
- for document loading use one form `from langchain.document_loaders` package, then use `RecursiveCharacterTextSplitter` to chunk it into smaller pieces

Free-Radical commented 1 year ago

@sime2408 Thank you, i'll give it a shot. BTW is there a Discord,etc channel where people are doing the same thing? i.e. integrating FOSS, etc models for langchain/llamaindex full functionality? (dissimilar to what is avaialble for ChatGPT/OpenAI) stuff.

tgcandido commented 1 year ago

@sime2408 I haven't tested yet, but the answers on this issue gave me the following idea: I'll try to test using tiktoken/openAI for generating embeddings, chromadb with persist_directory argument to save the embeddings db to disk, and llama.cpp server - not the binary - for inference; I'll report back here when I finish the tests.

khimaros commented 1 year ago

@tgcandido what did you end up settling on? personally, i'm having trouble getting a reasonable similarity_search from any of the vectorstores when using the LlamaCppEmbeddings. i'm not as worried about performance, just want good quality results. how well is similarity_search working for your use case?

createchange commented 1 year ago

My quick two cents:

LlamaCppEmbeddings took forever to create embeddings for a modest sized dataset (e.g. 10 Word documents). This evening, I tried switching to the Sentence Transformers Embeddings (https://python.langchain.com/en/latest/modules/models/text_embedding/examples/sentence_transformers.html), which I found to generate embeddings nearly instantaneously. It was a great success. I stored these locally in a Chroma DB without issue.

Unfortunately, querying against the vectorstore with local Llama (13B - I need to try 7B) was absurdly slow. Like, hours and hours of waiting and yet no response was ever returned. This was on an M1 Pro 16 with 32GB RAM.

I loaded up OpenAI instead of Llama, and got responses back very quickly - 10 seconds or less.

So - my current take is:

use Sentence Transformers Embeddings rather than LlamaCppEmbeddings for local processing
get and use a GPU if you want to keep everything local, otherwise use a public API or "self-hosted" cloud infra for inference

Disclaimer: I am a noob. Take my 2 cents with a grain of salt.

Edit: @Free-Radical - I am not a big Discord user, but the communities I have found are all associated with various tooling. Chroma, Langchain, etc. all have their own communities. I suspect, Chroma being the type of project it is, you can find some like-minded folks there.

larawehbe commented 1 year ago

@createchange Thanks for your insights. I am trying to do the same with sentence transformers from huggingface, but the results are not really promising Im using Langchain for semantic search saving the vector embeddings and docs in elastic search engine

I tried using openai embeddings and the answers where on point I tried using Sentence transformers and the results aren't quite good, as if the semantic search engine with HF embeddings are not accurate and not "semantic"

Any recommendations on how to have a fully offline usecase ?

sime2408 commented 1 year ago

Any recommendations on how to have a fully offline usecase ?

@larawehbe Try with this repo: https://github.com/imartinez/privateGPT

langchain-ai / langchain

Using llama for ConversationalRetrievalChain #2784