This is a tricky issue to reproduce because it only happens on certain systems and greatly depends on how the python process is started.
In the example provided, I look at the time between one node finishing, and the next node starting. On my local machine, and inside a DataBricks notebook, this is between 5 and 50 milliseconds, but when I run it on our DataBricks server directly, this is between 400 and 700 milliseconds. In a single run it is very consistently the same time, but it differs between runs.
This delay happens between each node, so for a flow with a useful number of nodes, it adds up to a 10 seconds delay which is prohibitively slow in a chat application.
I am not able to root-cause the issue because I do not understand what LangGraph is doing between nodes, especially for such a simple flow as you see in this example. I do see a lot of asynchronous functions so I suspect there is some kind of race condition or wait that takes this time to resolve.
All timing experiments were done on a dedicated cloud compute cluster with no other processes running.
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
This is a tricky issue to reproduce because it only happens on certain systems and greatly depends on how the python process is started.
In the example provided, I look at the time between one node finishing, and the next node starting. On my local machine, and inside a DataBricks notebook, this is between 5 and 50 milliseconds, but when I run it on our DataBricks server directly, this is between 400 and 700 milliseconds. In a single run it is very consistently the same time, but it differs between runs.
This delay happens between each node, so for a flow with a useful number of nodes, it adds up to a 10 seconds delay which is prohibitively slow in a chat application.
I am not able to root-cause the issue because I do not understand what LangGraph is doing between nodes, especially for such a simple flow as you see in this example. I do see a lot of asynchronous functions so I suspect there is some kind of race condition or wait that takes this time to resolve.
All timing experiments were done on a dedicated cloud compute cluster with no other processes running.
System Info
DataBricks Standard_DS3_v2 computer cluster (14 GB Memory, 4 cores) DatabRicks runtime version: 14.3 LTS ML (Apache Spark 3.5.0, Scala 2.12) OS: Linux #76-20-01-1-Ubuntu SMP Python: 3.10.12 langchain_core: 0.2.28 langchain: 0.0.348 langsmith: 0.1.98 langgraph: 0.2.0