Why is ollama running slowly?

eyurtsev commented 5 months ago

We need to investigate whether we have an issue with the ollama integration, and if so why?

Discussed in https://github.com/langchain-ai/langchain/discussions/18515

^{Originally posted by **gosforth** March 4, 2024} I'm playing with Langchain and Ollama. My source text is 90 lines poem (each line max 50 characters). First I load it into vector db (Chroma): ``` from langchain_community.llms import Ollama from langchain.chains import RetrievalQA from langchain_community.embeddings import OllamaEmbeddings from langchain_community.document_loaders import TextLoader from langchain_community.vectorstores import Chroma from langchain_text_splitters import CharacterTextSplitter # load the document and split it into chunks loader = TextLoader("c:/test/some_source.txt", encoding="utf8") documents = loader.load() # split it into chunks text_splitter = CharacterTextSplitter(chunk_size=2500, chunk_overlap=0, separator=".") docs = text_splitter.split_documents(documents) # Create Ollama embeddings and vector store embeddings = OllamaEmbeddings(model="mistral") # load it into Chroma db = Chroma.from_documents(docs, embeddings, persist_directory="c:/test/Ollama/RAG/data") # save db db.persist() ``` Execution time is about 25 seconds. Why so long?(!) For instance generating embeddings with SBERT is way shorter. Then I use these vectors with Ollama model: ``` from langchain_community.llms import Ollama from langchain.chains import RetrievalQA from langchain_community.embeddings import OllamaEmbeddings from langchain_community.vectorstores import Chroma # reset DB variable db=None embeddings = OllamaEmbeddings(model="mistral") # read from Chroma db = Chroma(persist_directory="c:/test/Ollama/RAG/data", embedding_function=embeddings) llm = Ollama(base_url='http://localhost:11434', model="mistral", temperature=0) qa_chain = RetrievalQA.from_chain_type( llm, retriever=db.as_retriever(search_type="similarity", search_kwargs={"k": 2}) ) question = "Here comes the question text?" result = qa_chain.invoke({"query": question}) result["result"] print(result) # delete collection db.delete_collection() ``` Execution time is... 26 seconds. Huge amount of time (really short text). My hardware: Ryzen 7 5700x, 48GB RAM, gtx 1050ti I tried different settings for chunk size, separator. Differences are trivial. Is there any trick I can speed it up? Looks like GPU load is max 50%, CPU similar, RAM piratically not used. Something wrong with the code? Any suggestion appreciated, Best

liugddx commented 5 months ago

Is everyone so slow? Or does it have something to do with hardware configuration?

msmmpts commented 5 months ago

Hi,

I ran the code shared below.

`from langchain_community.llms import Ollama import time

llm = Ollama(base_url='http://localhost:11434', model="llama3:instruct", temperature=0) start_time = time.time() response = llm.invoke("Tell me a joke") print("--- %s seconds ---" % (time.time() - start_time)) print(response)`

I also noticed the following:

Execution time was 69 Seconds which is very very slow
The output does not stop. It continuously keeps generating text as shown in the figure below.

Any comments / thoughts on fixing these issues?

Here are my installed libraries of langchain for reference

langchain 0.1.16
langchain-community 0.0.32
langchain-core 0.1.42
langchain-openai 0.1.3
langchain-text-splitters 0.0.1

Bovey0809 commented 5 months ago

I came across the same non-stop issue for llama3 with Ollama.

SinghJivjot commented 5 months ago

Please update some stuff. I ran all the langchain packages update after llama3 and that fixed this issue. But after that I am getting a ollama error 400. Let me know if you face the same issue.

ErikValle2 commented 5 months ago

I had the same issue using GraphCypherQAChain, it keeps talking with itself forever. These are the libraries and the system info:

langchain==0.1.16 langchain-cli==0.0.21 langchain-community==0.0.33 langchain-core==0.1.43 langchain-openai==0.0.8 langchain-text-splitters==0.0.1

platform linux python version 3.11.7

Of course, it generates the Cypher query and replies to my question, but it keeps running in a loop, and it adds markups in JSON format.

I think we're all set!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Let's wrap up!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

All done!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

That's it!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It looks like I'm just waiting for your confirmation before considering our task complete!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Task complete!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I think we've finished the answer generation!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Yay!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

...

SinghJivjot commented 5 months ago

@ErikValle2 If not fixed, please update langchain_core. I had the same issue. Turned up the libs and it was solved. Here are my versions: System Information

OS: Linux OS Version: #1 SMP Thu Jan 11 04:09:03 UTC 2024 Python Version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]

Package Information

langchain_core: 0.1.45 langchain: 0.1.16 langchain_community: 0.0.34 langsmith: 0.1.49 langchain_chroma: 0.1.0 langchain_cli: 0.0.21 langchain_experimental: 0.0.53 langchain_groq: 0.1.2 langchain_nomic: 0.0.2 langchain_pinecone: 0.0.3 langchain_text_splitters: 0.0.1 langchainhub: 0.1.15 langgraph: 0.0.38 langserve: 0.1.0

ErikValle2 commented 5 months ago

@SinghJivjot it works! Thank you

qsdhj commented 4 months ago

Hi, I think I have the same problem, with ollama 0.1.33 on Windows

Python 3.12.1 langchain-core 0.1.48 langchain 0.1.17 langchain-community 0.0.36 langchain-chroma 0.1.0 langchain-openai 0.1.5 langchain-text-splitters 0.0.1 langchainhub 0.1.15 langsmith 0.1.52

I have the problem that ollama is really slow, taking approx. 10 times as long than last week when using it for the llm in my RAG chain (300s instead of 30s)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Invoking just the llm works today fine. Yesterday it also took 30-60 seconds instead of 3-4.

llm = Ollama(model="llama3",
             temperature=0.0,
             )

llm.invoke("Tell me a joke")

I am unsure if thats a langchain, ollama, or me problem. The problem appeared at the 03.05.. At this day I updated ollama and reinstalled torch with cuda. Now I seem to have two different cuda versions installed.

grafik grafik

langchain-ai / langchain