Allow Internal-Frontend to Register/Update Namespace (and other methods)

andreclaro commented 1 year ago

Expected Behavior

Register/Update Namespace and other methods should be allow to be performed by connecting to the internal-frontend.

Actual Behavior

Currently several actions can only be done by the Frontend service. It we try to connect to the internal-frontend and perform those actions, such as, Register/Update Namespace, we will get the following error: unable to find bootstrap container for the given service name

As @MichaelSnowden mentioned:

Hi, André. That’s an interesting issue. To me, it looks like this is caused by the fact that the RegisterNamespace workflow service API’s implementation is using a hard-coded service name of primitives.FrontendService instead of fetching the current service name (which may be primitives.InternalFrontendService) [here](https://github.com/temporalio/temporal/blob/f1ef4db6783f11fa29c8f6e11f307959b3b23079/service/frontend/namespace_handler.go#L852) . We register the bootstrap container dynamically based on the service name, but it looks like, when reading it, we always assume that the caller is the frontend service. As you can imagine, there’s a lot of code shared between the two services, but there are some small differences, so it’s easy for bugs like this to crop up. This is all just my deduction from reading the code, but I’m not too familiar with it, since I didn’t write it. In the meantime, might I ask why you want to use the internal frontend service?

Btw, running all services together locally, I do not see this issue.
That is likely because the frontend service will write the bootstrap container to the shared registry that the internal frontend service will erroneously look into with the frontend service’s key when you run them together. If you just run the internal frontend service on its own, the bootstrap container won’t be there because the regular frontend service didn’t write it

Full error logs:

{
    "level": "error",
    "ts": "2023-09-28T15:25:12.080Z",
    "msg": "service failures",
    "operation": "RegisterNamespace",
    "wf-namespace": "test",
    "error": "unable to find bootstrap container for the given service name",
    "logging-call-at": "telemetry.go:328",
    "stacktrace": "go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).handleError\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:328\ngo.temporal.io/server/common/rpc/interceptor.(*TelemetryInterceptor).UnaryIntercept\n\t/home/builder/temporal/common/rpc/interceptor/telemetry.go:169\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/service/frontend.(*RedirectionInterceptor).handleLocalAPIInvocation\n\t/home/builder/temporal/service/frontend/redirection_interceptor.go:214\ngo.temporal.io/server/service/frontend.(*RedirectionInterceptor).Intercept\n\t/home/builder/temporal/service/frontend/redirection_interceptor.go:188\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/metrics.NewServerMetricsContextInjectorInterceptor.func1\n\t/home/builder/temporal/common/metrics/grpc.go:66\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n\t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.0/interceptor.go:344\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceLogInterceptor).Intercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_logger.go:84\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/rpc/interceptor.(*NamespaceValidatorInterceptor).NamespaceValidateIntercept\n\t/home/builder/temporal/common/rpc/interceptor/namespace_validator.go:111\ngoogle.golang.org/grpc.getChainUnaryHandler.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1179\ngo.temporal.io/server/common/rpc.ServiceErrorInterceptor\n\t/home/builder/temporal/common/rpc/grpc.go:145\ngoogle.golang.org/grpc.chainUnaryInterceptors.func1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1170\ngo.temporal.io/api/workflowservice/v1._WorkflowService_RegisterNamespace_Handler\n\t/go/pkg/mod/go.temporal.io/api@v1.24.0/workflowservice/v1/service.pb.go:1537\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1360\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:1737\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/pkg/mod/google.golang.org/grpc@v1.57.0/server.go:982"
}

Steps to Reproduce the Problem

Perform RegisterNamespace or UpdateNamespace to the internal-frontend

Specifications

Version: 1.22
Platform: Temporal-server

dnr commented 1 year ago

Internal-frontend is for use by the server worker role and history/matching roles only. All calls from clients should go through the normal frontend for proper authorization. In particular, there are no internal calls to RegisterNamespace so that just should't be happening.

I see the migration workflow (running on server worker) uses UpdateNamespace, which would trigger this. Are you using the migration workflow?
It looks this is only a problem if archival is enabled on the namespace.

MichaelSnowden commented 1 year ago

All calls from clients should go through the normal frontend for proper authorization.

In this case, I think André read the patch notes here about the internal frontend role and thought that it meant you could use it to bypass auth for admin requests as well. However, it seems like this is something we didn't intend to support because this was originally just for simplifying intern-node auth.

I think I see two ways we can go about this:

We can support this use case (not sure if we want to, though, as it seems insecure). This would mean making some code changes here.
We can say that we don't support this use case, and we can help you find another solution, André.

To me, this originally looked like a bug because we were propagating the "wrong" service name around, but if that's because we have, as a policy, that no one should be calling these APIs on the internal frontend, then it's not a bug. I think we should have a clearer error message, though. Maybe we can do something like check the request metadata to see if calls to the internal-frontend are from another service, and, if not, return an error message? Not as a method of protection, because it could obviously be spoofed, but as a way to detect misuse and inform the user.

In addition, it seems like no other nodes are calling RegisterNamespace on the internal-frontend (which makes sense because this is not something I see us automating). However, if we do in the future, it might make sense to fix this service name issue now (separately from the internal-frontend misuse error message change).

So, @dnr , I think we should do this:

Not support this use case (because it's insecure)
Return a descriptive error message if it looks like someone is calling the internal frontend from a non-internal service (because it might clear up misconceptions earlier)
Dynamically get the service name instead of using the hard-coded one here (because it would make it possible for this API to be called by other services later, if we decide to do so). Or, add it to a list of APIs which the internal-frontend does not support.

Also, @andreclaro , would you mind elaborating on your use case? Why don't you want to use the frontend service here?

dnr commented 1 year ago

The point about the migration workflow using UpdateNamespace means we do need to fix this.

But to be clear, external clients must not be making calls to internal-frontend, that must be prevented (with network restrictions or mTLS or both), or else there's no point to it.

andreclaro commented 1 year ago

First of all, thank you for the quick response!

We are using TLS (mTLS to be enabled later), network policies and we are now enabling authorization. We also have archival enabled for both history and visibility.

I totally understand that we shouldn't allow external service to access the internal-frontend.

My initial idea was to use the internal-frontend for administration tasks whenever required (only accessible by administrators with access to exec to the internal-frontend service), however we are planning to build a service to perform that by getting a JTW token from our authorization service.

The other new use case is to use the temporal-operator to manage namespaces but it seems that currently this operator does not support JTW token.

dnr commented 1 year ago

I see. I don't think we'd go out of our way to break calling internal-frontend with tctl/temporal cli, but it's not what it's intended for, in the same way that you generally wouldn't do RPC calls directly to history or matching services. I'd recommend setting up proper authorization and doing administrative tasks through the regular frontend.

andreclaro commented 1 year ago

Yes, that makes sense. What about point 2 (It looks this is only a problem if archival is enabled on the namespace.)? Are you going to fix this? thanks!

temporalio / temporal