truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
1.86k stars 162 forks source link

Too many open files, python process crash while recording with TruChain #849

Closed epinzur closed 2 months ago

epinzur commented 5 months ago

When recording using TruChain, my python process crashes constantly after 2426 records have been saved to the database. It crashes with the error: Too many open files. I've also seen via top that the process has a virtual memory size in the 200+ GB range before the crash. I've encountered this crash 5+ times in the past few days.

I get a HUGE stack trace after the crash occurs. See: pdf_splits.log

I'm using TruLens-eval 0.20.3. Currently this branch from Piotr: https://github.com/truera/trulens/tree/piotrm/deferred_mem, but I've also seen the crash on a default install of 0.20.3. I haven't tried with the latest release yet.

I'm recording in deferred mode.

This is my script:

import tru_shared

from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
import glob, os

os.environ["ASTRA_DB_ENDPOINT"] = os.environ.get("ASTRA_DB_ENDPOINT_PDF_SPLITS_2")
os.environ["ASTRA_DB_TOKEN"] = os.environ.get("ASTRA_DB_TOKEN_PDF_SPLITS_2")

framework = tru_shared.Framework.LANG_CHAIN

chatModel = tru_shared.get_azure_chat_model(framework, "gpt-35-turbo", "0613")
embeddings = tru_shared.get_azure_embeddings_model(framework)

pdf_datasets = []
for file_path in glob.glob('data/*/source_files/*.pdf'):
    dataset = file_path.split("/")[1]
    if dataset not in pdf_datasets:
        pdf_datasets.append(dataset)

collection_names = ["PyPDFium2Loader", "PyMuPDFLoader", "PyPDFLoader", "PDFMinerLoader_by_page", "PDFMinerLoader_by_pdf"]

prompt_template = """
Answer the question based only on the supplied context. If you don't know the answer, say: "I don't know".
Context: {context}
Question: {question}
Your answer:
"""
prompt = ChatPromptTemplate.from_template(prompt_template)

for collection_name in collection_names:
    vstore = tru_shared.get_astra_vector_store(framework, collection_name)
    pipeline = (
        {"context": vstore.as_retriever(), "question": RunnablePassthrough()}
        | prompt
        | chatModel
        | StrOutputParser()
    )

    tru_shared.execute_experiment(framework, pipeline, collection_name, pdf_datasets)

This script makes heavy use of my helper file: tru_shared

I get through almost 2 of the 5 collection_names before the crash. I can un-block myself by running a single collection at a time.

piotrm0 commented 3 months ago

Hi; this is a tough one to debug. If you see it happen next time, can you collect information for us about open handles in the process trulens is running in? That is, the output of

lsof -p 5998

Assuming trulens is running on 5998. You can find the process id with:

ps ax | grep python

There might be many other python process running depending on what you are doing so you will have to pick out the one that you think is causing the problem.

dosubot[bot] commented 2 months ago

Hi, @epinzur

I'm helping the trulens team manage their backlog and am marking this issue as stale. From what I understand, the issue involves a Python process crashing with a "Too many open files" error after saving 2426 records to the database while using TruChain for recording. Piotrm0 has suggested collecting information about open handles in the process trulens is running in to aid in debugging. However, the current status of the issue is unresolved.

Could you please confirm if this issue is still relevant to the latest version of the trulens repository? If it is, please let the trulens team know by commenting on the issue. Otherwise, feel free to close the issue yourself or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation.

yuvneshtruera commented 2 months ago

Closing this for now, please let us know if this happen again to re-open it.