microsoft / promptflow

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.
https://microsoft.github.io/promptflow/
MIT License
9.21k stars 835 forks source link

[BUG] Race condition with global state in process pool #3553

Open bhanson-techempower opened 2 months ago

bhanson-techempower commented 2 months ago

Describe the bug Prompt flow appears to modify the global working directory and workers in the line execution process pool can get spawned with different working directories.

How To Reproduce the bug I created an example project to reproduce the bug here: https://github.com/bhanson-techempower/promptflow-concurrency-bug

Some runs will succeed and others will randomly fail with:

Flow path .../promptflow-concurrency-bug/sub_flow2/sub_flow3 does not exist.

Due to the worker for that particular node being spawned after the working directory has been changed.

With the example project the bug is reproduced for me almost every time.

In our production application we're seeing it about half the time when running a batch of 20 runs.

Expected behavior The flow executes successfully every time because prompt flow does not share extra global state between processes.

Running Information:

{
  "promptflow": "1.12.0",
  "promptflow-core": "1.12.0",
  "promptflow-devkit": "1.12.0",
  "promptflow-tracing": "1.12.0"
}

Executable '.../promptflow-concurrency-bug/venv/bin/python'
Python (Darwin) 3.11.3 (main, May 25 2023, 12:42:30) [Clang 11.1.0 ]

Additional context

We've worked around the issue by modifying the way we invoke the sub flows:

from pathlib import Path
flow_path = Path(__file__).parent / "sub_flow1"
flow = load_flow(flow_path)
bhanson-techempower commented 2 months ago

Also not sure if I should open up another issue, but the example project also shows a bug with the trace viewer. When I open the link provided in the console only the errored cases show up. This is happening every time for me on that project, although I haven't noticed it in our actual applications.

There should be 10 rows here:

Screenshot 2024-07-16 at 14 36 20

We have noticed other subtle bugs with the trace viewer. Sometimes the information displayed in the table format doesn't match the underlying case when you click on it. The same line run might be duplicated a few times and then we need to click into each row to find the actual case we're looking for.

My suspicion is both of these are related to other concurrency issues and not actually a problem with the trace viewer.

brynn-code commented 2 months ago

Hi @bhanson-techempower , thanks for reaching us and the detail reproduce function. We just investigated this problem, I'll explain more about this.

Conclusion at first

  1. Move the 'load_flow' out of @tool function / 2. Use absolute path / 3. Set concurrency to 1, all of them can resolve the problem.

Root cause

  1. When running the flow, promptflow will change working directory to the flow directory to ensure flow can run successfully with correct imports.
  2. Nodes without dependencies are executed concurrently.
  3. In your flow, the 3 nodes didn't depends on each other so they got executed concurrently in one process, which caused the problem when executing to the load_flow line, as flow path got wrongly resolved due to cwd changed and effect each other.

Detail about the workaround

  1. Move the 'load_flow' out of @tool function -- the flow will be loaded when importing file, the cwd not changed at this time so flow can be loaded successfully. image
  2. Use absolute path -- load flow can work well with abs path because cwd will not effect the path anymore.
  3. Set concurrency to 1 -- the cwd got changed and revert back correctly step by step when concurrency is 1, so load flow can work successfully with correct cwd.

To resolve this problem To fully resolve this problem, we have to make each node run independently, for example in separate process, I'm afraid it will be a long-term work, I've add the 'long-term' tag and we'll keep this item open for anyone who meet the same problem, this item will be updated if we made changes to related part.