langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.78k stars 14.19k forks source link

Slow Embeddings With Ollama #21870

Open jkablan opened 2 months ago

jkablan commented 2 months ago

Checked other resources

Example Code


from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings.ollama import  OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
import tqdm
print('done loading imports')

def main(args):

    # Get the directory path from arguments
    directory_path = args.directory

    loader = PyPDFDirectoryLoader(directory_path)
    print('loading docs')
    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(chunk_size=400,chunk_overlap=200)
    print('splitting docs')
    splits = splitter.split_documents(docs);    

    embedAgent = OllamaEmbeddings(model='llama2',show_progress=True)
    print('generating embeddings')

    vectStore = Chroma.from_documents(documents=splits,embedding=embedAgent,persist_directory=directory_path)    

import ollama

def testOllamaSpeed(args):
    # Get the directory path from arguments
    directory_path = args.directory

    loader = PyPDFDirectoryLoader(directory_path)
    print('loading docs')
    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
    print('splitting docs')
    splits = splitter.split_documents(docs);    

    txts = []

    print('making txt')
    for doc in tqdm.tqdm(docs):
        txts.append(str(doc))

    print('making embeddings')
    mbeds = []
    for txt in tqdm.tqdm(txts):
       mbeds.append(ollama.embeddings(model='llama2',prompt=txt))

if __name__ == '__main__':

    # Create the argument parser
    parser = argparse.ArgumentParser(description="Script to process a directory path")

    # Add the -d argument for directory path
    parser.add_argument('-d', '--directory', type=str, required=True, help='Path to the directory')

    # Parse the arguments
    args = parser.parse_args()

    #main(args)
    testOllamaSpeed(args)

Error Message and Stack Trace (if applicable)

n/a

Description

Calls to Ollama embeddings API are very slow (1000 to 2000ms) . GPU utilization is very low. Utilization spikes 30% - 100% once every second or two. This happens if I run main() or testOllamaSpeed() In the example code. This would suggest the problem is with Ollama. But If I run the following code which does not use any langchain imports each call completes in 200-300ms and GPU utilization hovers at a consistent 70-80%. The problem is even more pronounced if I use mxbai-embed-large with the example code taking 1000 to 2000ms per call and the code below taking ~50ms per call. VRAM usage is never above 4ish GB (~25% of my total VRAM).

For reference my environment is: Windows 11 12 Gen i9-1250HX 128GB RAM NVIDIA RTX A4500 Laptop 16GB VRAM Ollama 0.1.38

import ollama
import os
import PyPDF2
import tqdm
import argparse

def read_pdfs_from_directory(directory_path):
    pdf_texts = {}

    for filename in os.listdir(directory_path):
        if filename.endswith('.pdf'):
            file_path = os.path.join(directory_path, filename)
            pdf_texts[filename] = read_pdf(file_path)

    return pdf_texts

def read_pdf(file_path):
    pdf_text = ""

    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            pdf_text += page.extract_text()

    return pdf_text

def split_into_chunks(input_string, chunk_size):
    # Use list comprehension to create chunks of the specified size
    chunks = [input_string[i:i+chunk_size] for i in range(0, len(input_string), chunk_size)]
    return chunks

def main(args):

    dir = args.directory 

    print('Reading pdfs')
    allFiles = read_pdfs_from_directory(dir)

    print('chunking')
    chunks = []
    for k,v in allFiles.items():
        chunks.extend(split_into_chunks(v,1000))

    print('Generating embeddings')
    for chunk in tqdm.tqdm(chunks):
        ollama.embeddings(model='llama2',prompt=chunk)
        #ollama.embeddings(model='mxbai-embed-large',prompt=chunk)
    print('done')

if __name__ == '__main__':

    # Create the argument parser
    parser = argparse.ArgumentParser(description="Script to process a directory path")

    # Add the -d argument for directory path
    parser.add_argument('-d', '--directory', type=str, required=True, help='Path to the directory')

    # Parse the arguments
    args = parser.parse_args()

    main(args)

System Info

langchain==0.2.0 langchain-chroma==0.1.1 langchain-community==0.2.0 langchain-core==0.2.0 langchain-text-splitters==0.2.0

keenborder786 commented 1 month ago

Yes, I faced a similar situation since Ollama does not support concurrency. To overcome this issue, I started multiple containers of Ollama and distributed the embedding requests across the containers in a round-robin manner.

asalaerekat commented 3 weeks ago

I'm having the same issue, ollama took more than 20 hours to generate embeddings using 'nomic-embed-text' on 190K texts. now I want to generate embeddings using llama3 on the same texts, but I'm worried it will take forever! Can we run it on a GPU or run it in batches/parallel processing, or any other idea to make the run faster? Did you resolve this issue or come up with an idea to resolve it?