slackapi / java-slack-sdk

Slack Developer Kit (including Bolt for Java) for any JVM language
https://slack.dev/java-slack-sdk/
MIT License
571 stars 212 forks source link

Does Oauth Flow work on a cluster of replica nodes? #1209

Closed tzuanw closed 11 months ago

tzuanw commented 12 months ago

Hi Team, we have installed our Slack App on a springboot server and are encountering the following error during the install process:

[or-http-epoll-4] .b.s.b.o.d.OAuthDefaultStateErrorHandler : Invalid state parameter detected However, this only happens when we deploy multiple replica nodes of our slack server on Kubernetes. When the cluster is scaled down to 1 node, the invalid state param error does not happen anymore.

There is nothing bespoke about our set up, our App Config has been set with:

.oauthInstallPath("/slack/install")
.oauthRedirectUriPath("/slack/oauth_redirect")

We have also set up the redirect url in the App Management tab. Obviously, this screenshot is just an example ss of what we have configured, the actual url we are using in production is different:

image

Other important considerations we have taken, we moved away from using FileOauthStateService, and even tried using AmazonS3OAuthStateService. We even tried using our on implementation of OauthStateService using a PostGresql DB. It appears that even using a decoupled storage layer doesn't make the oauth process stateless across the nodes of the cluster.

Without going further into details, I was wondering if you have any insights about running oauth flow through a k8 cluster of node instances of slack server vs 1 node? Or is the expectation that the sdk only supports running oauth on a single server at a given time?

The Slack SDK version

gradle dependencies

dependencies { implementation("org.springframework.boot:spring-boot-starter-webflux") implementation("com.slack.api:slack-api-client:1.29.2") implementation("com.slack.api:slack-api-model-kotlin-extension:1.29.2") implementation("com.slack.api:bolt:1.29.2")

Java Runtime version

openjdk version "17.0.8.1" 2023-08-24 OpenJDK Runtime Environment Temurin-17.0.8.1+1 (build 17.0.8.1+1) OpenJDK 64-Bit Server VM Temurin-17.0.8.1+1 (build 17.0.8.1+1, mixed mode, sharing)

OS version

Ubuntu 22.04.3 LTS

seratch commented 12 months ago

Hi @tzuanw, thank you for asking the question.

As long as you use a properly implemented OAuthStateService, there is no blocker for serving the OAuth endpoints across multiple app sever nodes.

One possible cause of your situation could be related to web browser cookie handling. As you may already be aware, the state parameter validation verifies if the state value in the query string and the one in cookies match. OAuthStateService sets the cookie in /slack/install this way, and then your redirect URI endpoint tries to retrieve the string to verify. If this process does not work for some reason, your situation can arise.

Please double-check if you're using the same domain for both endpoints. Also, if you have a specific reason not to host the endpoints in the same domain, you can customize the code by overriding the methods.

I hope this helps.

tzuanw commented 11 months ago

Hi @seratch thank you for your insights! Tbh, when there is only node, I have been able to run the install process successively for up to 20 consecutive attempts. When there are 2 nodes, it fails immediately.

To clarify, we are not mutating or interfering with the query string and the cookies throughout the install process. Furthermore, we are using Kong configuration to manage all ingress traffic which routes to only one domain, ie both:

.oauthInstallPath("/slack/install") .oauthRedirectUriPath("/slack/oauth_redirect")

sit on the same domain.

One thing we are doing that is bespoke is we implement OAuthV2DefaultSuccessHandler in order to determine when the install is complete and then trigger another unrelated process. Another reason for implementing is I have also observed that the only way to capture theteam name because for some reason, team name is missing from public interface Installer. Would you concur that implementing OAuthV2DefaultSuccessHandler should have no effect on the above state exception?

Finally a somewhat related question is for issuing new state, how would you even be able to intercept or control the lifecycle of a new state?

Thanks for your help on this!

seratch commented 11 months ago

Would you concur that implementing OAuthV2DefaultSuccessHandler should have no effect on the above state exception?

This customization does not influence the state validation in any way. You can explore the logic of how the services can be utilized here: https://github.com/slackapi/java-slack-sdk/blob/v1.32.1/bolt/src/main/java/com/slack/api/bolt/service/builtin/DefaultOAuthCallbackService.java#L98-L132

Finally a somewhat related question is for issuing new state, how would you even be able to intercept or control the lifecycle of a new state?

If you need more customization of the OAuthStateService's behavior, you can override issueNewState, isValid, etc. However, if your app isn't currently compatible with the default behavior, something in your app or infrastructure may prevent secure OAuth state validation. I advise against altering the service class code just to bypass verification.

Given the information at hand, I do not have any further recommendations. I hope you identify the cause of your issue soon!

tzuanw commented 11 months ago

Sorry to bother you, but are there any documented integration tests of the slack sdk oauth process against a cluster of nodes vs a single node? Even better is there something in the repo that can be run so I can observe how it is running vs how our process is running?

Also, I just considered another scenario: is it possible the one leg of the oauth request is being routed to a different node and hence causing a difference in the state value of the cookie vs the querystring?

seratch commented 11 months ago

are there any documented integration tests of the slack sdk oauth process against a cluster of nodes vs a single node? Even better is there something in the repo that can be run so I can observe how it is running vs how our process is running?

The OAuth flow is a manual operation conducted by a human using a web browser, so automated tests for it could involve techniques such as Selenium testing. At present, we don't have any resources available about this.

Also, I just considered another scenario: is it possible the one leg of the oauth request is being routed to a different node and hence causing a difference in the state value of the cookie vs the querystring?

If your OAuth endpoint hosting does not maintain browser cookies in such situations, unexpected results may occur.

That being said, I am unsure about the cause of your particular issue. I suspect that cookie handling may play a role, but the problem could also be related to accessing persistent state values on the server-side. While I acknowledge that you've tried both S3 and your own RDB, I'm not completely confident that you've used them correctly given the conditions you stated -- a single node versus multiple nodes. Enabling debug-logging and/or incorporating additional logs might help you pinpoint the problem.

tzuanw commented 11 months ago

That makes sense but also please let me frame the scenario more carefully.

Is it possible that the given state is different from the session state (cookie), because the session state is not consistent across nodes?

And if so, is there a way for me to persist the session state in similar fashion to the various implementations of OauthStateService? If Im not mistaken, the only check there is for whether the state exists in the database, and that is independent of the check between the given state and the one in the session.

seratch commented 11 months ago

Is it possible that the given state is different from the session state (cookie), because the session state is not consistent across nodes?

No, I don't think this can happen. As you can see, the session is just a simple cookie string sent by a web browser. The user-agent should not know which host they're accessing, right? Just in case, I recommend that you confirm the actual request/response headers in a web browser using your browser's inspector (devtools) to ensure they are the ones you expect.

As for the persistent layer on the server side, please refer to the built-in ones' code to learn what is expected. Having said that, perhaps your implementation should be fine. To make sure which logic is rejecting, you can consider overridding isValid method to have more debug logs.

tzuanw commented 11 months ago

I will enable debug logs and get to the bottom of this. If you would rather close this ticket for now, please go ahead. Thanks again @seratch for your keen insights.