temporalio / sdk-core

Core Temporal SDK that can be used as a base for language specific Temporal SDKs
MIT License
262 stars 70 forks source link

[Bug] Incompatible ephemeral server got cached in temp #779

Closed Cyberuben closed 1 month ago

Cyberuben commented 1 month ago

What are you really trying to do?

When creating an ephemeral server (via the various SDKs), the downloaded binary seems to be cached in /tmp, but is not re-downloaded when a new version is released. It seems like there might have been breaking changes in the server binary, and it took me a long time to figure out which binary was actually used to start this server.

Describe the bug

We use the Temporal TypeScript SDK in our project and use TestWorkflowEnvironment.createLocal to start a local server, to use with our unit tests.

const testEnv = await TestWorkflowEnvironment.createLocal({
    server: {
        log: { format: 'json', level: 'fatal' },
    },
});

Locally, everything ran fine, and it also seemed to run fine while pushing the changes to CI. These changes were eventually merged and deployed.

As of a few days ago, after the code was already deployed, tests started failing in irrelevant PRs, that ran on the CI environment.

Log output on CI ``` Error: invalid argument "fatal" for "--log-level" flag: fatal is not one of required values of debug, info, warn, error, never Usage: temporal server start-dev [flags] Flags: -f, --db-filename string File in which to persist Temporal state (by default, Workflows are lost when the process dies). --dynamic-config-value stringArray Dynamic config value, as KEY=JSON_VALUE (string values need quotes). --headless Disable the Web UI. -h, --help help for start-dev --http-port int Port for the frontend HTTP API service. Default is off. --ip string IP address to bind the frontend service to. (default "localhost") --log-config Log the server config being used to stderr. --metrics-port int Port for /metrics. Default is off. -n, --namespace stringArray Specify namespaces that should be pre-created (namespace "default" is always created). -p, --port int Port for the frontend gRPC service. (default 72[33](https://github.com/tryFragile/api/actions/runs/9911767960/job/27385131753#step:9:34)) --search-attribute stringArray Search attributes to register, in key=type format. --sqlite-pragma stringArray Specify SQLite pragma statements in pragma=value format. --ui-asset-path string UI custom assets path. --ui-codec-endpoint string UI remote codec HTTP endpoint. --ui-ip string IP address to bind the Web UI to. Default is same as --ip. --ui-port int Port for the Web UI. Default is --port + 1000. Global Flags: --color string Set coloring. Accepted values: always, never, auto. (default "auto") --env string Environment to read environment-specific flags from. (default "default") --env-file $HOME/.config/temporalio/temporal.yaml File to read all environments (defaults to $HOME/.config/temporalio/temporal.yaml). --log-format string Log format. Options are "text" and "json". Default is "text". --log-level server start-dev Log level. Default is "info" for most commands and "warn" for server start-dev. Accepted values: debug, info, warn, error, never. (default "info") --no-json-shorthand-payloads Always show all payloads as raw payloads even if they are JSON. -o, --output string Data output format. Note, this does not affect logging. Accepted values: text, json, jsonl, none. (default "text") --time-format string Time format. Accepted values: relative, iso, raw. (default "relative") ```

Note the Usage: and Flags: in the help text.

It seems like the fatal option was no longer valid, but we pin the version of our TypeScript dependencies, so this shouldn't have been caused in the first place.

Once we changed it to --log-level never, we suddenly got more log output than we did before, it printed the IP and port the server and metrics were available at.

Locally, I changed the config for our tests to also mute the output:

const testEnv = await TestWorkflowEnvironment.createLocal({
    server: {
        //                             vvvvv this was already live
        log: { format: 'json', level: 'never' },
        extraArgs: ['--output', 'none'],
    },
});

This gave me the following output locally:

Output ``` Incorrect Usage: flag provided but not defined: -output Did you mean "--ui-port"? NAME: temporal server start-dev - Start Temporal development server USAGE: Start Temporal Server on localhost:7233 with: temporal server start-dev View the UI at http://localhost:8233 To persist Workflows across runs, use: temporal server start-dev --db-filename temporal.db COMMANDS: help, h Shows a list of commands or help for one command OPTIONS: --db-filename value, -f value File in which to persist Temporal state (by default, Workflows are lost when the process dies) --namespace value, -n value [ --namespace value, -n value ] Specify namespaces that should be pre-created (namespace "default" is always created) --port value, -p value Port for the frontend gRPC service (default: 7233) --http-port value Port for the frontend HTTP service (default: disabled) --metrics-port value Port for /metrics (default: disabled) --ui-port value Port for the Web UI (default: --port + 1000, e.g. 8233) --headless Disable the Web UI (default: false) --ip value IPv4 address to bind the frontend service to (default: "127.0.0.1") --ui-ip value IPv4 address to bind the Web UI to (default: same as --ip) --ui-asset-path value UI custom assets path --ui-codec-endpoint value UI remote codec HTTP endpoint --log-format value Set the log formatting. Options: ["json", "pretty"]. (default: "json") --log-level value Set the log level. Options: ["debug" "info" "warn" "error" "fatal"]. (default: "info") --sqlite-pragma value [ --sqlite-pragma value ] Specify SQLite pragma statements in pragma=value format. Pragma options: ["journal_mode" "synchronous"]. --dynamic-config-value value [ --dynamic-config-value value ] Dynamic config value, as KEY=JSON_VALUE (string values need quotes) ```

Note how the output now shows USAGE: and OPTIONS: instead of Usage: and Flags:.

After a lot of digging, I found that starting an ephemeral server tries to download the executable and stores it in a temp folder.

It took a lot of time to figure out where this binary was written, but eventually I found it:

$ ls -lah /var/folders/<path>/<path>/T
total 281136
(...)
-rwxr-xr-x@   1 user  staff    74M  8 jul 14:28 temporal-sdk-typescript-1.10.1
-rwxr-xr-x@   1 user  staff    56M  8 jul 14:29 temporal-test-server-sdk-typescript-1.10.1

I deleted this file, so sadly I can't figure out which version it is anymore, but it redownloaded, and everything ran fine.

Given the fact that the TypeScript SDK version is in the file name, I expected the binary to stay pinned to some version at least, but this didn't seem to be the case. I'm not sure if this is something I should report in the TypeScript SDK repo or here, because the TypeScript SDK seems to be doing its job.

Minimal Reproduction

I'm not able to reproduce this, as I don't have the old binary left, I wasn't smart enough to keep it safe before I removed it.

Environment/Versions

Additional context

cretz commented 1 month ago

but is not re-downloaded when a new version is released

This is by intention, so that successive runs do not have to hit an HTTP endpoint at all (e.g. even to check if they are "latest"). By default, the binary is downloaded and stored with the version number of the SDK. An SDK upgrade gets a new binary, but we don't upgrade from first download while on same SDK. We do need to make it clearer where this is cached so users can remove it. We are also going to make the output more clearly show which server/CLI version is in use.

If you do need a certain version of the CLI, most SDKs allow you to specify it (though maybe not TypeScript, will check). Similarly if this download+cache mechanism isn't ideal for you, you can pass a path to your own downloaded version of the CLI and the download+cache mechanism is bypassed entirely.

Cyberuben commented 1 month ago

I think documenting how and where this binary is cached should be in the readme or in the docs somewhere, it was very challenging to find out this behavior. I understand that this is intentional though!

What I don't understand is how a non-backwards compatible version of the executable was released, even though the SDK version remained the same.

cretz commented 1 month ago

What I don't understand is how a non-backwards compatible version of the executable was released, even though the SDK version remained the same.

The CLI has not yet reached GA. We do need to be clearer on the dev server API from the different languages that it is using a non-GA dev server. On some SDKs (e.g. Python) we do make clear that "extra args" and similar are not guaranteed stable, e.g. at https://python.temporal.io/temporalio.testing.WorkflowEnvironment.html#start_local we write:

In the future, the dev server implementation may be changed to another implementation. Therefore, all devserver prefixed parameters are dev-server specific and may not apply to newer versions.

But we have missed that in TypeScript, we will update those API docs. We do not plan to make any more backwards incompatible changes, but technically it is possible until it reaches GA (hopefully soon). We will also make an issue for you to be able to pin the CLI version.

Cyberuben commented 1 month ago

Makes sense, thanks for the info!

cretz commented 1 month ago

I have opened https://github.com/temporalio/sdk-typescript/issues/1464

mjameswh commented 1 month ago

Closing this ticket in favor of temporalio/sdk-typescript#1464.