temporalio / sdk-core

Core Temporal SDK that can be used as a base for language specific Temporal SDKs
MIT License
262 stars 70 forks source link

[Bug] Rust SDK workflow ends up stuck in Running state #686

Closed djc closed 6 months ago

djc commented 6 months ago

What are you really trying to do?

I'm trying to build my first workflow with the Rust SDK. This workflow calls a few activities and then returns anyhow::Result<()>.

Describe the bug

Apart from a few snags, I was able to get the Rust SDK working and produced a worker that appears to be able to serve workflow execution requests. However, once the workflow execution completes (that is, all the activities it calls into succeed), the workflow ends up sitting there in Running state and never moves into Completed state.

Minimal Reproduction

Don't have this yet, and might be a little hard to do? I did have previous test workflows that ended up in Completed.

Environment/Versions

Sushisource commented 6 months ago

@djc This will definitely require at least some kind of repro, there are bunches of tests using the SDK to run workflows to completion so it's probably something pretty specific. If the workflow function exits the workflow will complete, so it may be that you've got something that just isn't resolving.

djc commented 6 months ago

Well, I wrote an example that has the same structure, pushed it to a repo. In my actual workflow, the three activities and the cleanup() function do some actual work, mainly HTTP requests and Postgres/Redis interactions via async Rust code, but that's not easy to reproduce outside of our specific environment.

Unfortunately this doesn't reproduce the issue I was seeing despite resulting in a pretty similar event history. In case it helps, I'm attaching the event history from my actual workflow (which implements certificate provisioning via ACME):

6f497446-182d-4a6b-b10f-3d0e8d676249_events.json

And here's the event history from the example code:

15664f8b-e65d-46e8-8fb6-2112a3f7d3d3_events.json

Right now this is blocking us from deploying Temporal into production, so any help is much appreciated.

Sushisource commented 6 months ago

@djc Based on what you have here, the cleanup() function in the real code must be doing something that prevents the workflow from completing. In both histories the last activity is completing, but in the real one the workflow is hitting some await point and pausing whereas in the fake example it finishes entirely.

If you are doing any kind of await that isn't ultimately using one of the Temporal APIs under the hood, that's not going to work, as the SDK implementation will respond to the workflow task as soon as the top-level future returns Pending. So, if that Pending gets caused by something non-Temporal, you're liable to just hang.

djc commented 6 months ago

If you are doing any kind of await that isn't ultimately using one of the Temporal APIs under the hood, that's not going to work, as the SDK implementation will respond to the workflow task as soon as the top-level future returns Pending. So, if that Pending gets caused by something non-Temporal, you're liable to just hang.

Huh, I definitely didn't expect that. So how am I supposed to use async Rust in Temporal activities?

Sushisource commented 6 months ago

@djc You can do whatever you want in an activity - it's inside workflows where you should only be using temporal APIs (or basic stuff like futures combinators, select, etc)

djc commented 6 months ago

Ahh, yes. Alright, that fixes it, thanks for the explanation!