Open deific opened 3 days ago
The LLM node in Dify version 0.8.3 may remain in the "Running" state for a long time and not terminate due to several reasons:
Context Fetching: The _fetch_context
method involves a generator that yields RunRetrieverResourceEvent
events. If the context fetching process is slow or if there is an issue with the context data, it could cause the node to remain in the running state.
Model Invocation: The _invoke_llm
method involves invoking the large language model and handling the result through a generator. If the model invocation takes a long time or if there is an issue with the streaming of results, it could cause the node to remain in the running state.
Event Handling: The _handle_invoke_result
method processes the results from the model invocation. If there is a delay or issue in processing the streamed chunks or final results, it could cause the node to remain in the running state.
Here is the relevant code that shows these processes:
def _fetch_context(self, node_data: LLMNodeData, variable_pool: VariablePool) -> Generator[RunEvent, None, None]:
if not node_data.context.enabled:
return
if not node_data.context.variable_selector:
return
context_value = variable_pool.get_any(node_data.context.variable_selector)
if context_value:
if isinstance(context_value, str):
yield RunRetrieverResourceEvent(retriever_resources=[], context=context_value)
elif isinstance(context_value, list):
context_str = ""
original_retriever_resource = []
for item in context_value:
if isinstance(item, str):
context_str += item + "\n"
else:
if "content" not in item:
raise ValueError(f"Invalid context structure: {item}")
context_str += item["content"] + "\n"
retriever_resource = self._convert_to_original_retriever_resource(item)
if retriever_resource:
original_retriever_resource.append(retriever_resource)
yield RunRetrieverResourceEvent(
retriever_resources=original_retriever_resource, context=context_str.strip()
)
def _invoke_llm(
self,
node_data_model: ModelConfig,
model_instance: ModelInstance,
prompt_messages: list[PromptMessage],
stop: Optional[list[str]] = None,
) -> Generator[RunEvent | ModelInvokeCompleted, None, None]:
db.session.close()
invoke_result = model_instance.invoke_llm(
prompt_messages=prompt_messages,
model_parameters=node_data_model.completion_params,
stop=stop,
stream=True,
user=self.user_id,
)
generator = self._handle_invoke_result(invoke_result=invoke_result)
usage = LLMUsage.empty_usage()
for event in generator:
yield event
if isinstance(event, ModelInvokeCompleted):
usage = event.usage
self.deduct_llm_quota(tenant_id=self.tenant_id, model_instance=model_instance, usage=usage)
def _handle_invoke_result(
self, invoke_result: LLMResult | Generator
) -> Generator[RunEvent | ModelInvokeCompleted, None, None]:
if isinstance(invoke_result, LLMResult):
return
model = None
prompt_messages: list[PromptMessage] = []
full_text = ""
usage = None
finish_reason = None
for result in invoke_result:
text = result.delta.message.content
full_text += text
yield RunStreamChunkEvent(chunk_content=text, from_variable_selector=[self.node_id, "text"])
if not model:
model = result.model
if not prompt_messages:
prompt_messages = result.prompt_messages
if not usage and result.delta.usage:
usage = result.delta.usage
if not finish_reason and result.delta.finish_reason:
finish_reason = result.delta.finish_reason
if not usage:
usage = LLMUsage.empty_usage()
yield ModelInvokeCompleted(text=full_text, usage=usage, finish_reason=finish_reason)
These methods involve generators and event handling that could potentially cause delays or issues leading to the node remaining in the "Running" state [1].
To continue talking to Dosu, mention @dosu.
Self Checks
Dify version
0.8.3
Cloud or Self Hosted
Cloud, Self Hosted (Docker)
Steps to reproduce
llm node has been in Running state for a long time and cannot be terminated。 Sometimes the large model nodes do not respond and remain in a running state for a long time, and no exceptions or errors are found in the backend logs. Sometimes llm node has already returned a response, but the node status is still running.
✔️ Expected Behavior
The status of LLM is either ending normally or indicating an error.
❌ Actual Behavior
llm node has been in Running state for a long time