Running LLM Data Iterator lead to WORKER TIMEOUT and sent code 134!

DDT-scrip commented 2 months ago

Bug Description

While iterating through a CSV file using NamastexLabs LLM Data Iterator, if the iteration count accessed a certain threshold after a few successful iterates, errors will suddenly appear

CRITICAL 2024-08-21 11:28:46 - CRITICAL - glogging - WORKER TIMEOUT (pid:50230) glogging.py:269 ERROR 2024-08-21 11:28:46 - ERROR - glogging - Worker (pid:50230) was sent code 134! glogging.py:272

while the UI show error message: Error Building Component Component Ollama has not been built yet

or Error Building Component Component CSV Loader has not been built yet

increasing the langflow run --timeout setting didn't solve the problemshows

Reproduction

LLM Iteration error recreation.json upload a CSV with over 20 rows, and a long paragraph in each row set the Iteration Count to a large number or 0 to go through all row save the output CSV file to a path

Expected behavior

The flow can Iterate through a large amount of rows without errors appear

Who can help?

No response

Operating System

Ubuntu 22.04.4 LTS

Langflow Version

1.0.5

Python Version

3.10

Screenshot

No response

Flow File

LLM Iteration error recreation.json

dosubot[bot] commented 2 months ago

To resolve the WORKER TIMEOUT and code 134 errors when using NamastexLabs LLM Data Iterator to iterate through a large CSV file in Langflow, you can adjust the max_concurrency and use_multithreading parameters in the DirectoryComponent configuration.

Here is an example of how you can configure these parameters:

def build(
    self,
    path: str,
    depth: int = 0,
    max_concurrency: int = 1,  # Reduce concurrency to avoid timeouts
    load_hidden: bool = False,
    recursive: bool = True,
    silent_errors: bool = False,
    use_multithreading: bool = False,  # Disable multithreading to avoid code 134 errors
) -> List[Optional[Record]]:
    resolved_path = self.resolve_path(path)
    file_paths = retrieve_file_paths(resolved_path, load_hidden, recursive, depth)
    loaded_records = []
    if use_multithreading:
        loaded_records = parallel_load_records(file_paths, silent_errors, max_concurrency)
    else:
        loaded_records = [parse_text_file_to_record(file_path, silent_errors) for file_path in file_paths]
    loaded_records = list(filter(None, loaded_records))
    self.status = loaded_records
    return loaded_records

By setting max_concurrency to a lower value and disabling use_multithreading, you can mitigate the issues related to worker timeouts and code 134 errors ^[1].

To continue talking to Dosu, mention @dosu.

DDT-scrip commented 2 months ago

please explain the code 134 errors @dosu.

dosubot[bot] commented 2 months ago

Code 134 errors typically indicate that a process was terminated due to an abort signal, often caused by a critical issue such as a segmentation fault or an out-of-memory condition. In the context of Langflow, these errors can occur when the system runs out of resources or encounters a severe error during execution.

To mitigate code 134 errors, you can try the following:

Reduce Concurrency: Lower the max_concurrency setting to reduce the number of concurrent tasks.
Disable Multithreading: Set use_multithreading to False to avoid potential issues with multithreading.
Increase Timeout: Ensure the timeout settings are sufficiently high to allow long-running tasks to complete.

Here's an example configuration:

def build(
    self,
    path: str,
    depth: int = 0,
    max_concurrency: int = 1,  # Reduce concurrency to avoid timeouts
    load_hidden: bool = False,
    recursive: bool = True,
    silent_errors: bool = False,
    use_multithreading: bool = False,  # Disable multithreading to avoid code 134 errors
) -> List[Optional[Record]]:
    resolved_path = self.resolve_path(path)
    file_paths = retrieve_file_paths(resolved_path, load_hidden, recursive, depth)
    loaded_records = []
    if use_multithreading:
        loaded_records = parallel_load_records(file_paths, silent_errors, max_concurrency)
    else:
        loaded_records = [parse_text_file_to_record(file_path, silent_errors) for file_path in file_paths]
    loaded_records = list(filter(None, loaded_records))
    self.status = loaded_records
    return loaded_records

Adjusting these settings can help prevent the system from running into critical issues that lead to code 134 errors.

To continue talking to Dosu, mention @dosu.

DDT-scrip commented 2 months ago

will code 134 cause WORKER TIMEOUT error? @dosu

dosubot[bot] commented 2 months ago

Yes, code 134 errors can cause a WORKER TIMEOUT error. Code 134 typically indicates that a process was terminated due to an abort signal, often caused by critical issues such as a segmentation fault or an out-of-memory condition. When such a severe error occurs, it can lead to the worker process being terminated abruptly, resulting in a timeout as the system waits for a response that never comes. This is why reducing concurrency, disabling multithreading, and ensuring sufficient timeout settings are crucial to mitigate these issues.

To continue talking to Dosu, mention @dosu.

vasconceloscezar commented 2 months ago

Hey, I made this flow a while ago and had these issues. That's why there is the start index to process in batches.

However you can try to set another .env variable like this:

LANGFLOW_FRONTEND_TIMEOUT=6000

Try this and check if it works

DDT-scrip commented 2 months ago

@vasconceloscezar I tried the variable setting and didn't find any significant change, even when I increased the setting to 60000 The flow might be too heavy for my bad hardware Although I wondered where is the bottleneck, it was not the system memory, but more likely the graphics memory? But the loop starts up just fine, and the amount of data in each row is roughly the same.

Maybe the unloading of the LLM isn't clean, so each invoke eats up more and more memory?

If you have any insight on this I would really like to know. PS: would setting up langsmith/langwatch or other recommed tools help debug these kinds of errors?

DDT-scrip commented 2 months ago

@dosu explain what is timeout and frontend timeout, and how to change these setting

dosubot[bot] commented 2 months ago

In Langflow, there are two main timeout settings you can configure: worker_timeout and frontend_timeout.

Worker Timeout (worker_timeout):
- This setting defines the maximum time (in seconds) that a worker process is allowed to run before it is terminated. The default value is 300 seconds.
- You can configure this in the Settings class located in src/backend/base/langflow/services/settings/base.py:
```
class Settings(BaseSettings):
   # Other settings...
   worker_timeout: int = 300
   """Timeout for the API calls in seconds."""
   # Other settings...
```
- Additionally, you can set this value using the Langflow CLI with the --timeout option, which defaults to 60 seconds:
```
- `--timeout`: Sets the worker timeout in seconds. The default is `60`.
```
Frontend Timeout (frontend_timeout):
- This setting defines the maximum time (in seconds) that the frontend API calls are allowed to take before timing out. The default value is 0 seconds, which means no timeout.
- You can configure this in the Settings class as well:
```
class Settings(BaseSettings):
   # Other settings...
   frontend_timeout: int = 0
   """Timeout for the frontend API calls in seconds."""
   # Other settings...
```

To change these settings, you can either modify the .env file or set the environment variables directly in your system. For example, to set the frontend_timeout to 6000 seconds, you can add the following line to your .env file:

LANGFLOW_FRONTEND_TIMEOUT=6000

Or, you can set it directly in your terminal before running Langflow:

export LANGFLOW_FRONTEND_TIMEOUT=6000

Regarding your concern about memory management and the unloading of LLMs, Langflow handles memory management for LLM invocations using the fix_memory_inputs function and has a cleanup routine implemented via the flush_langfuse_callback_if_present function. This ensures that memory inputs are managed and any langfuse callbacks are properly flushed, aiding in memory management and cleanup ^[1].

If you suspect that the issue might be related to memory leaks or performance bottlenecks, using built-in telemetry and logging tools in Langflow can help diagnose these issues. Langflow collects various performance metrics and provides a logs page to inspect component executions and message history, which can be useful for debugging ^[2]^[3]^[4].

If you need further assistance, feel free to ask!

To continue talking to Dosu, mention @dosu.

vasconceloscezar commented 2 months ago

Is this still happening in the latest langflow version?

I just iterated through a CSV with 30 rows, and everything worked as expected.

Can you please test it at the main branch?

DDT-scrip commented 1 month ago

@vasconceloscezar Currently, I'm on langflow: 1.0.17 ollama: 0.3.9

I tried to increase the --timeout setting in CLI to 180 and increase the LANGFLOW_FRONTEND_TIMEOUT=6000 in the env file, , but it makes the situation worse (15-20 rows down to less than 10)

I tried updating Ollama to the newest version and changing different models (llama3, 3.1, mistral-nemo, phi3) no matter whether the model is big or small it all stops iterating at around 15-20 row

I am suspicious it may be caused by ollama, as some other user also experience similar issue

There is no crash, it just stops generating and hangs, without error code: 9月 04 08:55:22 llmserver ollama[221701]: [GIN] 2024/09/04 - 08:55:22 | 200 | 21.357729469s | 127.0.0.1 | POST "/api/chat" 9月 04 08:55:42 llmserver ollama[221701]: [GIN] 2024/09/04 - 08:55:42 | 200 | 20.279408554s | 127.0.0.1 | POST "/api/chat" 9月 04 08:55:51 llmserver ollama[221701]: [GIN] 2024/09/04 - 08:55:51 | 200 | 8.569732642s | 127.0.0.1 | POST "/api/chat"

then langflow pop out the code 134 error: "2024-09-05 11:37:33 - DEBUG - - Invoking LLM\n" "2024-09-05 11:37:39 - CRITICAL - glogging - WORKER TIMEOUT (pid:244579)\n" "2024-09-05 11:37:39 - ERROR - glogging - Worker (pid:244579) was sent code 134!\n" "2024-09-05 11:37:39 - INFO - glogging - Booting worker with pid: 263450\n"

but changing the timeout and frontend timeout also affect the performance, so currently I'm not sure what's the problem

vasconceloscezar commented 1 month ago

This could be an issue with ollama indeed, I'll try to setup ollama here and give it a try.

Could you test with a cheaper model like gpt 4 mini? Just to see if it works?

DDT-scrip commented 1 month ago

@vasconceloscezar Due to limited hardware, all my models (llama3, 3.1, mistral-nemo, phi3) are the smaller ones phi3:3.8b 2.2GB, llama3:8b 4.7GB, mistral-nemo 23b 7.1GB

all quite cheap models my setup is 2 X 1060 6GB, so the model won't exceed the memory of a single card, and definitely won't exceed both cards the model response time of my CSV iterate prompt averages around 20 seconds in successful iteration when it suddenly stops it usually stop under 10s of proccess

carlosrcoelho commented 4 weeks ago

@DDT-scrip

Do you need any assistance with this case? If not, please let us know if this issue can be closed.

DDT-scrip commented 3 weeks ago

@carlosrcoelho I currently move the loop out of the Langflow system, using API to solve the issue the loop still stops but not after hundreds of iterations hours of running (still the WORKER TIMEOUT (pid:249640), Worker (pid:) was sent code 134! issue thought) so the issue still exists, but moving the loop out to an exterior python script it's much better

and it makes me suspect more it's caused by ollama

I think this issue can be closed, and thanks for your help

langflow-ai / langflow