temporalio / sdk-core

Core Temporal SDK that can be used as a base for language specific Temporal SDKs
MIT License
262 stars 70 forks source link

[Bug] ConnectionReset trying to connect to Temporal Server #720

Closed andrepastore closed 6 days ago

andrepastore commented 4 months ago

What are you really trying to do?

Using Rust, trying to connect the client to the server using tls_cert and the connect command as the exemple below

    let mut client_options_builder = ClientOptionsBuilder::default();
    client_options_builder
        .target_url(target_url)
        .client_name("name")
        .client_version("version");`

    if client_tls_config.is_some() {
        client_options_builder.tls_cfg(TlsConfig {
            server_root_ca_cert: None,
            domain: None,
            client_tls_config,
        });
    }

    let client_options = client_options_builder
        .build()
        .map_err(|err| TemporalError::ClientError(err.to_string()))?;

    let client = client_options
        .connect(namespace, meter, None)
        .await
        .map_err(|err| TemporalError::ClientError(err.to_string()))?;

Describe the bug

before #698 the code above connects correctly to the server and we can begin to creating workers, after the merge the same code returns the following error

Error: ClientError("'get_system_info call' error after connection: Status { code: Unknown, message: \"transport error\", source: Some(tonic::transport::Error(Transport, hyper::Error(Io, Kind(ConnectionReset)))) }")

Minimal Reproduction

In RUST, using any rev after the following:

temporal-client = { git = "https://github.com/temporalio/sdk-core", rev = "c55ba57" }
temporal-sdk = { git = "https://github.com/temporalio/sdk-core", rev = "c55ba57" }
temporal-sdk-core = { git = "https://github.com/temporalio/sdk-core", rev = "c55ba57" }
temporal-sdk-core-api = { git = "https://github.com/temporalio/sdk-core", rev = "c55ba57" }
temporal-sdk-core-protos = { git = "https://github.com/temporalio/sdk-core", rev = "c55ba57" }
temporal-sdk-core-test-utils = { git = "https://github.com/temporalio/sdk-core", rev = "c55ba57" }

and trying to connect a client to the server as of the exemple above

Environment/Versions

Additional context

the cargo.toml that gives me the error cargo.toml.txt

Sushisource commented 4 months ago

@andrepastore Hmm, weird. This definitely isn't some fundamental bug as this code./tonic version is running now in our other SDKs and in our tests etc just fine.

Are you using a self hosted server? Cloud? Is there some proxy in the way?

andrepastore commented 3 months ago

@Sushisource we are using temporal cloud, with TLS certificates for auth, there are no proxies in the environment

happylinks commented 1 week ago

@andrepastore did you figure this out? Running into the same issue.

Sushisource commented 1 week ago

@andrepastore Sorry this fell off my rader. @happylinks Are you using the Rust SDK as well? Tonic was upgraded relatively recently here https://github.com/temporalio/sdk-core/pull/782 and I also made a change that should read system certs by default here https://github.com/temporalio/sdk-core/pull/785

Those both might affect whatever is going on here.

happylinks commented 1 week ago

@Sushisource no problem! to be honest I'm using a fork, because I'm also writing workflows in rust 😅 (I know, not officially supported yet). It is up to date with main though, so not entirely sure yet what's causing it. I'm still doing more testing, will let you know what I find, so far it only happens when I'm using TLS certs, without it (locally) it works fine.

happylinks commented 1 week ago

Just based on your remark about system certs, this is in my dockerfile:

FROM debian:bookworm-slim

RUN apt-get update && apt-get install -y libssl-dev ca-certificates && rm -rf /var/lib/apt/lists/*

Not sure if there's any other ssl-related things I would need to install.

Sushisource commented 1 week ago

Yeah, I think that would probably do it in terms of having default certs. I think it's got to be something system related, since we do have tests that are running core based sdks against cloud (and, obviously users in real life doing that all the time too), hard for me to say exactly what though.

happylinks commented 1 week ago

OK! I can reproduce it on my macbook now, without the fork, so basically the same as @andrepastore

Cargo.toml:

temporal-client = { git = "https://github.com/temporalio/sdk-core", branch = "master" }
temporal-sdk = { git = "https://github.com/temporalio/sdk-core", branch = "master" }
temporal-sdk-core = { git = "https://github.com/temporalio/sdk-core", branch = "master" }
temporal-sdk-core-api = { git = "https://github.com/temporalio/sdk-core", branch = "master" }
temporal-sdk-core-protos = { git = "https://github.com/temporalio/sdk-core", branch = "master" }

Log

Error: `get_system_info` call error after connection: Status { code: Unknown, message: "transport error", source: Some(tonic::transport::Error(Transport, hyper::Error(Io, Kind(ConnectionReset)))) }

Code:

pub async fn create_client() -> Result<RetryClient<Client>, anyhow::Error> {
    let env = types::envs::TellaEnv::from_env();

    let address = get_env_var("TEMPORAL_ADDRESS", "localhost:7233", &env);
    let namespace = get_env_var("TEMPORAL_NAMESPACE", "default", &env);

    let temporal_key =
        get_env_var_opt("TEMPORAL_KEY", &env).map(|v| v.replace("\\n", "\n"));
    let temporal_pem =
        get_env_var_opt("TEMPORAL_PEM", &env).map(|v| v.replace("\\n", "\n"));

    let tls_config = match (temporal_key, temporal_pem) {
        (Some(key), Some(pem)) => Some(TlsConfig {
            domain: None,
            server_root_ca_cert: None,
            client_tls_config: Some(ClientTlsConfig {
                client_cert: pem.into_bytes(),
                client_private_key: key.into_bytes(),
            }),
        }),
        _ => None,
    };

    let url = Url::from_str(&format!("http://{address}"))?;
    let server_options = match tls_config {
        Some(tls_config) => sdk_client_options(url).tls_cfg(tls_config).build(),
        None => sdk_client_options(url).build(),
    }?;

    let client = server_options.connect(namespace, None).await?;
    Ok(client)
}
Sushisource commented 1 week ago

Ah, ok but that's against local server - I assume it was configured to use TLS? How is that server configured and running?

happylinks commented 1 week ago

Nah, sorry, I mean local worker configured against Temporal Cloud.

Sushisource commented 1 week ago

Oh, yeah I read the env var getter and assumed the default was what you were hitting. I'll try this out at some point today and see what I get.

happylinks commented 1 week ago

Ah, I see how that's confusing, yeah I set the temporal cloud config via env vars. Thanks for checking, let me know if I can help test anything!

andrepastore commented 1 week ago

hey, I didn't check this in a while, I was going to try the newest version to see if the issue was solved, but apparently it still remains, let me know if I can help with anything =)

Sushisource commented 1 week ago

So I added a simple tls test to Core's CI that connects to cloud, which looks essentially identical to your repro setup and it works fine: https://github.com/temporalio/sdk-core/pull/800

I couldn't get anything to repro locally either. That's sort of all expected of course since this works just fine in lots and lots of places.

Can you confirm that you can connect to cloud with the same setup properly with another SDK like Go or Java?

happylinks commented 1 week ago

I think it's the "https://" instead of "http://", testing on prod now.

happylinks commented 1 week ago

Yeah I think that fixed it 😅 Added this:

let scheme = if tls_config.is_some() {
    "https"
} else {
    "http"
};
let url = Url::from_str(&format!("{scheme}://{address}"))?;
andrepastore commented 6 days ago

@happylinks just tested it here and it did not work in my case, but our url is in grpc and not https as in your case, here is the example in this case:

grpc://boleto-seeker-dev.grnoi.tmprl.cloud:7233

happylinks commented 6 days ago

@andrepastore so just to confirm, you tried to switch it from grpc to https but got the same error?

andrepastore commented 6 days ago

@happylinks just tested it, if the url is using https it starts the worker, but then I receive this error: OpenTelemetry metrics error occurred. Metrics exporter otlp failed with the grpc server returns error (The service is currently unavailable): , detailed error message: tcp connect error: Connection refused (os error 61)

It seems it's an issue with the grpc, maybe some dependency that is not imported correctly in my case?

happylinks commented 6 days ago

@andrepastore mmm not sure about that one. Maybe you can share some more of the client connection code and the part where it passes the opentelemetry config?

At least happy we're moving forward, maybe this already helps @Sushisource understand the problem better :)

andrepastore commented 6 days ago

I tried to consolidate the full code in one block, hope it's clear:

let server_url = env::var("TEMPORAL_ADDRESS").unwrap_or("localhost:7233".to_owned());
let temporal_setup = TemporalSetup {
    server_url: format!("grpc://{server_url}"),
    namespace: env::var("TEMPORAL_NAMESPACE").unwrap_or("boleto-seeker-dev.grnoi".to_owned()),
    worker_setup: TemporalWorkerSetup {
        task_queue: env::var("TASK_QUEUE").unwrap_or("boleto-seeker-task-queue".to_owned()),
        worker_id: env::var("WORKER_ID").unwrap_or("boleto-seeker-worker".to_owned()),
    },
};

let dd_agent_host = env::var("DD_AGENT_HOST").unwrap_or("localhost".to_owned());
let dd_agent_port = env::var("DD_AGENT_PORT").unwrap_or("4317".to_owned());
let dd_agent_url = Url::from_str(&format!("grpc://{dd_agent_host}:{dd_agent_port}"))?;

let otel_collector_options = OtelCollectorOptionsBuilder::default()
    .url(dd_agent_url)
    .headers(HashMap::new())
    .build()?;
let otlp_exporter = build_otlp_metric_exporter(otel_collector_options)?;

let telemetry_options = TelemetryOptionsBuilder::default()
    .metrics(Arc::new(otlp_exporter) as Arc<dyn CoreMeter>)
    .build()?;

let runtime = CoreRuntime::new_assume_tokio(telemetry_options)?;

let target_url = Url::from_str(temporal_setup.server_url).map_err(|err| {
    TemporalError::ClientError(format!("Invalid temporal server url :: error: {:?}", err))
})?;

let tls_cert = env::var("TEMPORAL_TLS_CERT")
    .ok()
    .and_then(|cert| general_purpose::STANDARD.decode(cert).ok());
let tls_key = env::var("TEMPORAL_TLS_KEY")
    .ok()
    .and_then(|key| general_purpose::STANDARD.decode(key).ok());
let client_tls_config = match (tls_cert, tls_key) {
    (Some(cert), Some(key)) => Some(ClientTlsConfig {
        client_cert: cert,
        client_private_key: key,
    }),
    _ => None,
};

let mut client_options_builder = ClientOptionsBuilder::default();
client_options_builder
    .target_url(target_url)
    .client_name("boleto-seeker-worker")
    .client_version("0.1.0");

if client_tls_config.is_some() {
    client_options_builder.tls_cfg(TlsConfig {
        server_root_ca_cert: None,
        domain: None,
        client_tls_config,
    });
}

let client_options = client_options_builder
    .build()
    .map_err(|err| TemporalError::ClientError(err.to_string()))?;

let client = client_options
    .connect(temporal_setup.namespace, runtime.telemetry().get_temporal_metric_meter())
    .await
    .map_err(|err| TemporalError::ClientError(err.to_string()))?;

Ok(client)
Sushisource commented 6 days ago

Well, the OpenTelemetry metrics error occurred. Metrics exporter otlp failed with the grpc server returns error (The service is currently unavailable): , detailed error message: tcp connect error: Connection refused (os error 61) error is unrelated to the rest of the worker connection. That's seemingly a fairly clear refusal by what looks like datadog to accept the connection. I can't really say why they are refusing your connection. Maybe you need to be providing some kind of authentication token in the headers which you are currently leaving empty (though I'd expect an unauthorized error in that case). Connection refusal could be the wrong port, or maybe it also doesn't like the grpc scheme. The mTLS options for the worker connection don't have anything to do with the OTel connection.

I'm going to close this issue since it looks like problems with the worker's TLS connection were a result of misconfiguration. I would recommend using the https:// scheme unless you have some compelling reason not to.