CUDA memory won't release with HuggingFaceEmbeddings + local embedding model

1100111GTH commented 8 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.

Example Code

local_embedding = HuggingFaceEmbeddings(model_name=embedding_path) local_vdb = FAISS.load_local(vector_path, local_embedding, "default")

Error Message and Stack Trace (if applicable)

No response

Description

CleanShot 2024-02-18 at 00 47 51 i find when i try to get something from faiss, the gpu memory raise up, that's normal. but when the work is down, i mean i reterive already, but gpu memory not falling down ( even i closed the interface "gradio web" ), that the problem is. i'm building a gradio web app for my commpany, many people will use, when oneperson use to get something from faiss, the embedding model will use another memory, i mean if embedding working will use 2gib ( suppose )，so two person call = 4gib, three person call = 6gib, not 2gib, 2.3 gib.... it cost too many resources, so, how i can mannully stop the embedding model, when the work is down and release the gpu memory. tha's very important to me, thanks for your help. 🌹

System Info

python 3.9.18 langchain lastest ubuntu 20.04 lts

dosubot[bot] commented 8 months ago

🤖

Thank you for providing a detailed description of your issue. It seems like the GPU memory is not being released after the embeddings are used. This could be due to a variety of reasons, and I'll need a bit more information to help you troubleshoot this.

Firstly, could you please provide the version of the HuggingFace library you're using? Also, it would be helpful to know the specific model you're using for embeddings.

Secondly, could you share the code snippet where you're calling the FAISS and HuggingFaceEmbeddings? This will help me understand the context in which these embeddings are being used and released.

Lastly, are you using any other libraries or frameworks that might be interacting with your GPU? Sometimes, other libraries can hold onto GPU memory and prevent it from being released.

Once I have this information, I'll be better equipped to help you resolve this issue.

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

1100111GTH commented 8 months ago

retriever = local_vdb.as_retriever(
    search_type="similarity_score_threshold", 
    search_kwargs={"k": 20, 'score_threshold': 0.3}
)
retriever_compress = ContextualCompressionRetriever(
    base_compressor=BgeRerank(), 
    base_retriever=retriever
)

BgeRerank() is based on langchain.retrievers.document_compressors.cohere_rerank.py, that will use another Reranker model from local, the memory management is the same. those two model make a lot of pain on me 😧, if i put them to the cpu, the situation maybe better, but i am afraid cpu overload, because i try to build a system may will get 200 call at the same time..

1100111GTH commented 8 months ago

the model i use is bge-large-zh-v1.5 ( embedding ) and bge-reranker-large ( reranker )

1100111GTH commented 8 months ago

read_persist_var("chat_mode_choose") == chatmode[1]:
        database_answer = RunnableWithMessageHistory(
            create_retrieval_chain(retriever_compress, create_stuff_documents_chain(llm, prompt_1)),
            RedisChatMessageHistory,
            input_messages_key="input",
            history_messages_key="history",
            output_messages_key="answer"
        )

1100111GTH commented 8 months ago

sentence-transformers==2.2.2

1100111GTH commented 8 months ago

# Self file
import os
import sys
current_dir = os.path.dirname(os.path.abspath(__file__))  # 获取当前文件的绝对路径（ 仅脚本运行模式可行 ）
parent_dir = os.path.dirname(current_dir)  # 获取父目录的路径
sys.path.append(parent_dir)  # 把父目录添加到 sys.path
from config.config import read_persist_var
from config.config import reranker_path
# Basic
from typing import Optional, Sequence
# Langchain
from langchain.schema import Document
from langchain.pydantic_v1 import Extra, root_validator
from langchain.callbacks.manager import Callbacks
from langchain.retrievers.document_compressors.base import BaseDocumentCompressor
from sentence_transformers import CrossEncoder

class BgeRerank(BaseDocumentCompressor):
    model_name:str = reranker_path
    # 最终返回的文档数量
    top_n: int = 3
    """Number of documents to return."""
    model:CrossEncoder = CrossEncoder(model_name)
    """CrossEncoder instance to use for reranking."""

    def bge_rerank(self,query,docs):
        model_inputs =  [[query, doc] for doc in docs]
        scores = self.model.predict(model_inputs)
        results = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
        return results[:self.top_n]

    class Config:
        """Configuration for this pydantic object."""

        extra = Extra.forbid
        arbitrary_types_allowed = True

    def compress_documents(
        self,
        documents: Sequence[Document],
        query: str,
        callbacks: Optional[Callbacks] = None,
    ) -> Sequence[Document]:
        """
        Compress documents using BAAI/bge-reranker models.

        Args:
            documents: A sequence of documents to compress.
            query: The query to use for compressing the documents.
            callbacks: Callbacks to run during the compression process.

        Returns:
            A sequence of compressed documents.
        """
        if len(documents) == 0:  # to avoid empty api call
            return []
        doc_list = list(documents)
        _docs = [d.page_content for d in doc_list]
        results = self.bge_rerank(query, _docs)
        final_results = []
        for r in results:
            doc = doc_list[r[0]]
            doc.metadata["relevance_score"] = r[1]
            final_results.append(doc)
        return final_results

this is reranker

langchain-ai / langchain