Random Application Lockdown on ECS Fargate with Langchain and FastAPI

Santiagowwi commented 11 months ago

System Info

Operating system/architecture: Linux/X86_64 CPU | Memory 8 vCPU | 16 GB Platform version 1.4.0 Launch type FARGATE

Project libraries: snowflake-sqlalchemy==1.4.6 python-dotenv==0.21.0 openai==0.27.2 langchain==0.0.336 pandas==2.0.2 boto3==1.26.144 colorama==0.4.6 fastapi==0.100.1 pydantic~=1.10.8 pytest~=7.1.2 uvicorn~=0.17.6 cassio==0.1.3 sentry-sdk==1.29.2 langsmith==0.0.66 numpy==1.24.3 SQLAlchemy==1.4.46 psycopg2-binary==2.9.7 tiktoken==0.4.0 httpx==0.24.1 unidecode==1.3.7 transformers==4.28.0 transformers[torch] tensorflow==2.12.1 keras==2.12.0

Python version of the project: python:3.10-slim-bullseye

Who can help?

No response

Information

[ ] The official example notebooks/scripts
[X] My own modified scripts

Related Components

[X] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[ ] Vector Stores / Retrievers
[X] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[X] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

It's quite challenging to replicate the error as it appears to be rather random. After a few requests, FastAPI stops responding following the OPTIONS query of the endpoint. This issue seems to be attributable to one of the libraries in use. I observed this error after a code refactoring in the project, moving from legacy chains to chains with LCEL. Since this refactoring, the ECS system has exhibited peculiar behavior. Extensive debugging has been conducted throughout the codebase, yet there are no indications of the error's origin. It's worth noting that everything functions flawlessly in local emulation, with no occurrence of any unusual errors. The problem arises when the code is deployed to the ECS Fargate instance, and I want to emphasize that this issue did not exist before the aforementioned changes were made.

Captura de pantalla 2023-11-22 a la(s) 5 51 24 p m

Expected behavior

I need someone to help me with new ways to debug this extremely rare bug, to give me ideas on what to do, what to show from my machine, what can be done, or if it's some incompatibility between the libraries. I haven't been able to pinpoint the specific point where the program stops, and it's proving to be very challenging.

dosubot[bot] commented 11 months ago

🤖

Hello @Santiagowwi,

I'm Dosu, a bot designed to assist with issues related to the LangChain repository. I'm here to help you solve bugs, answer your questions, and guide you on your path to becoming a contributor. While we're waiting for a human maintainer, I'm here to assist you. Let's tackle this issue together!

I understand that you're experiencing a challenging issue with your application running on ECS Fargate after refactoring from legacy chains to chains with LCEL in LangChain. Here are a few suggestions on how you might approach debugging this issue:

Check for Compatibility Issues: You're using langchain==0.0.336, but the latest version is v0.0.339rc1. It's possible that there might be some compatibility issues with the older version of LangChain and the other libraries you're using. Try updating LangChain to the latest version and see if the issue persists.
Logging: If you haven't already, add extensive logging to your application. This can help you identify the last successful operation before the application stops responding. Python's built-in logging module can be used for this purpose.
Debugging in the ECS Environment: Since the issue only occurs in the ECS Fargate environment, it might be helpful to debug in that environment. AWS provides a guide on how to debug applications in the ECS environment here.
Isolate the Issue: Try to isolate the issue by gradually removing or commenting out parts of your code, especially the parts that were changed during the refactoring. This might help you identify the specific part of your code that's causing the issue.
Check for Resource Issues: Check if your application is running out of resources. ECS Fargate provides metrics that you can monitor to check if your application is running out of CPU or memory. You can view these metrics in the AWS Management Console.
Check for Network Issues: Since the application stops responding after the OPTIONS query, it might be a network-related issue. Check if there are any network policies or security groups that might be blocking the network traffic to your application.
Check for Concurrency Issues: Since LCEL chains support asynchronous operations, there might be some concurrency issues that are causing the application to stop responding. Try to identify any potential race conditions or deadlocks in your code.

I hope these suggestions help you in debugging the issue. If you have any more information or if there's anything else you'd like to know, please feel free to ask.

Sources

Santiagowwi commented 11 months ago

We meticulously implemented logging throughout our codebase, diligently tracing each step of the program. However, the logs haven't revealed any discernible faults. That's precisely why I've brought the issue into the collective gaze for additional insights.
Similar sentiments apply to our debugging efforts within the ECS environment. The prints consistently exhibit the same behavior, with no anomalous patterns. It's perplexing because the program sporadically halts, seemingly at random intervals during user interactions.
For this point, I plan to isolate the issue by crafting a minimal application that mimics the conditions leading to the error. Once I've replicated it, I'll share the distilled version with you for further analysis.
AWS Fargate's resource utilization appears to be well within normal bounds.
As for connectivity concerns, the likelihood seems minimal. Given that our application resides within the AWS infrastructure, network-related issues are less probable.
Notably, the endpoint in question isn't asynchronous; we're simply employing an invoke.

radiachkik commented 9 months ago

Have you ever found a solution to this problem?

I am experiencing the same and am currently suspecting the ThreadPoolExecutor used by both FastAPI / Starlette and the async Langchain methods. Upgrading from Python 3.10 to 3.12 (where some ThreadPoolExecutor issues / deadlocks have been solved) did not solve the issue either.

Even though the sync batch methods also uses a ThreadPoolExecutor, it does not use the default one used e.g. by the async invoke etc. (at least not in 0.1.4), so I am not quite sure how this could relate, even though I am not an expert in Python threading.

I would appreciate any hints or solutions. :)

Santiagowwi commented 8 months ago

@radiachkik We were unable to find the root cause of the problem; it seemed to be OpenAI timeouts, but in the end, we couldn't find the root issue. It's very hard to debug that.

Our decision was, instead of having the entire application on an ECS, to make a backend on ECS and put everything related to chains or LLMS stuff on lambda infrastructure. So, that was the only way to make the entire app work.

radiachkik commented 8 months ago

Hey @Santiagowwi . Thank you for your response!

In our case, the problem was caused by us not setting a timeout for the Redis client we used as the cache for LLM calls and the embeddings. After setting a timeout and switching to a more robust Redis instance, the problem seems to be solved.

My initial guess therefore was not accurate, as it seems like the issue was not caused by the threading mechanism. Using the async methods that in term use the threads just made this problem harder to debug.

langchain-ai / langchain