igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
641 stars 385 forks source link

Cannot load from Amazon S3 bucket (TooManyRequestsException) #1311

Closed victoriayi closed 1 year ago

victoriayi commented 1 year ago

igv0.log

Hello, "Amazon > Load from S3 bucket" does not respond. Please see attached log beginning from Apr 04, 2023.

jrobinso commented 1 year ago

How are you passing credentials? Have you setup Cognito as described in the Readme?

victoriayi commented 1 year ago

Yes we have. There are periods when it works and when it doesn't.

AWS Support gave us this response:

Here the "Rate Exceeded" error occurred because a single identity-id requested more than 5 GetOpenIdToken calls in a second. To be more specific, the Identity ID sent 8 requests which resulted in RateExceed error.

As per my internal team, to resolve this issue, you may attempt to bring the per-identity GetOpenIdToken API call limit under 5 RPS by following below suggestions : a) Setup a retry architecture or exponential back-off mechanism to ensure that they minimize calls on the per-identity basis[2]. b) Split the calls across multiple identities instead of using a single identity for all requests. If not, the GetOpenIdToken requests will keep coming in at a constant rate which will keep returning throttling errors without time for the throttle to cool down. c) We also recommend to look into the possibility of caching the token itself, which can reduce the number of calls to the Identity pool service. The OpenID token is valid for 10 minutes[3].

[2]. Error retries and exponential backoff in AWS https://docs.aws.amazon.com/general/latest/gr/api-retries.html [3]. GetOpenIdToken - The OpenID token is valid for 10 minutes. https://docs.aws.amazon.com/cognitoidentity/latest/APIReference/API_GetOpenIdToken.html

It's really one person attempting to load at a given second, so we're unsure what would cause 8 requests to be sent.

jrobinso commented 1 year ago

Hmm, me either. Can you tell me how exactly are they trying to load? I haven't been able to reproduce it but I am just doing simple loads from the menu picker. I need some steps to reproduce, the more explicit the better.

brainstorm commented 1 year ago

I haven't been able to reproduce it but I am just doing simple loads from the menu picker. I need some steps to reproduce, the more explicit the better.

+1, we've not seen this in our org (UMCCR) yet, happy to dig further with more details not only about the file being loaded but also the overall configuration of your setup. For instance, I'd like to have a peek at the oauth.config you are using, since I suspect there might be some sort of (infinite?) redirect loop somewhere?

victoriayi commented 1 year ago

Sorry for the delay. We wanted to confirm if the error occurs after logging in or after attempting to load the S3 bucket, but for a while we saw no issues again so we couldn't replicate the error. Until today...

igv0.log

Beginning from Apr 20, 2023 in the log: 12:16 Open IGV 12:20 Amazon > Login > login is successful 12:21 Amazon > Load from S3 bucket > we get the unhandled exception.

oauth-config_prod.txt Attach is our oauth.config. (Changed the file extension to .txt in order to upload here & we censored our IDs with asterisks.)

Thank you!

brainstorm commented 1 year ago

@victoriayi, Can you please share the token refresh timeout you have defined in Cognito?:

You can use the refresh token to retrieve new ID and access tokens. By default, the refresh token expires 30 days after your application user signs into your user pool. When you create an application for your user pool, you can set the application's refresh token expiration to any value between 60 minutes and 10 years.

If it's too short it might be expiring mid-IGVs session and generating the problems you are experiencing since I'm not sure there's a mechanism implemented for this contingency. As a rule of thumb, the refresh token expiry time should be more than your average user's session time (at least, typically longer).

Other things you might want to check, when comparing with our oauth config, I've observed some differences that might also be triggering the issues you are experiencing:

{
    "apiKey": "",
    "client_id": "*",
    "project_id": "*",
    "scope": "email%20openid%20profile",
    "authorization_endpoint": "https://*.auth.ap-southeast-2.amazoncognito.com/login",
    "token_endpoint": "https://*.auth.ap-southeast-2.amazoncognito.com/oauth2/token",
    "auth_provider": "Amazon",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_secret": "",
    "aws_cognito_fed_pool_id": "ap-southeast-2:*",
    "aws_cognito_pool_id": "ap-southeast-2_*",
    "aws_cognito_role_arn": "arn:aws:iam::*:role/Cognito_igvAuth_Role",
    "aws_region": "ap-southeast-2",
    "redirect_uris": [
      "http://localhost:60151/oauthCallback"
    ]
}

Please, pay special attention to the attribute token_endpoint since the url seems to be lacking the oauth2 path component. Also we don't use a client_secret. I'd recommend going through our IGV frontend guide and double-check the backend blogpost too.

It'd also strongly recommend upgrading from 2.13.0 (your version) to the latest one as of today, 2.16.0: https://igv.org/download/html/download.html

I'll do the same and check if your issue pops up EDIT: I just tested it with latest IGV and cannot reproduce the issue, please check the oauth config details mentioned above, could be a loop URL redirect issue.

jrobinso commented 1 year ago

Thanks @brainstorm , minor update, we just released 2.16.1 that fixes a nasty bug affecting some VCF files. So I would suggest updating to that.

victoriayi commented 1 year ago

Hello,

Authentication flow session duration: 3 minutes Refresh token expiration: 60 minutes Access token expiration: 5 minutes ID token expiration: 5 minutes

Thanks for sharing your oauth config, we'll ask our sysadmin to try making the changes you suggested. (I believe we referred to https://umccr.org/blog/igv-amazon-backend-setup/ to set this up, which also lacks the oauth2 component in its token_endpoint and uses a client secret... I hope this is indeed the issue though!) I've also updated my IGV, thank you.

brainstorm commented 1 year ago

I'd strongly suggest bumping up all those timeouts significantly, I suspect that's the main problem actually.

brainstorm commented 1 year ago

Here are our production timeouts if that helps? @victoriayi Did you have any other hiccups after our last message exchange?

Screenshot 2023-07-28 at 3 03 59 pm

For third party user reference, those settings can be found here on the AWS console.

brainstorm commented 1 year ago

And here's a good read about session expiration times: https://www.sjoerdlangkemper.nl/2023/08/16/session-timeout/

Good HN comments too: https://news.ycombinator.com/item?id=37173339

jrobinso commented 1 year ago

For what its worth, I have encountered this error now testing our standard icognito setup. It's happened twice out of maybe 20 test startups. In both cases logging out and back in via the Amazaon menu fixed it. The exact error is

Caused by: software.amazon.awssdk.services.cognitoidentity.model.TooManyRequestsException: Rate exceeded (Service: CognitoIdentity, Status Code: 400, Request ID: b7e338a5-ee41-4684-8a1a-b3ca30c0eeaf, Extended Request ID: null)

It looks like my access token expiration is 1 hour, refresh expiration is 30 days.

So, possibly an IGV error here , will keep looking.

My incognito configuration follows


  {
    "apiKey": "",
    "project_id": "igv",
    "auth_provider": "Amazon",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "aws_region": "us-east-1",
    "scope": "email%20openid%20profile",
    "client_id": "*",
    "client_secret": "*",
    "authorization_endpoint": "https://igv-demo.auth.us-east-1.amazoncognito.com/login",
    "token_endpoint": "https://igv-demo.auth.us-east-1.amazoncognito.com/token",
    "aws_cognito_fed_pool_id": "us-east-1:2926325e-0938-4dfc-840a-fa8a3e4f63f7",
    "aws_cognito_pool_id": "us-east-1_vfq4mnd4t",
    "aws_cognito_role_arn": "arn:aws:iam::*:role/Cognito_igvAuth_Role"
  }
brainstorm commented 1 year ago

I've been thinking that perhaps external actors are hitting the authorization_endpoint more than 200 times in an attempt to bruteforce JWTs, enumerate users or similar external attacks? Can you check your CloudWatch according to these instructions (under Track quota usage section)?:

https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html

jrobinso commented 1 year ago

Sorry I'm not finding much on CloudWatch. I don't have any alarms set. Could you be more specific on what to look for?

On Mon, Sep 11, 2023 at 10:08 PM Roman Valls Guimera < @.***> wrote:

I've been thinking that perhaps external actors are hitting the authorization_endpoint more than 200 times in an attempt to bruteforce JWTs, enumerate users or similar external attacks? Can you check your CloudWatch according to these instructions (under Track quota usage section)?:

https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html

— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1311#issuecomment-1714980324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHD2HDDKW6I5BLI2FJNENLXZ7U3TANCNFSM6AAAAAAWTEVZ7A . You are receiving this because you commented.Message ID: @.***>

brainstorm commented 1 year ago

Sure, I should have been more specific... under CloudWatch->Metrics, you can plot a bunch of Cognito-related metrics over time, for instance, that's how it looks like for TokenRefreshSuccess in three different (colors) UserPools in our system:

Screenshot 2023-09-12 at 3 43 02 pm

It seems like those metrics have only started to tick in the last couple of months for us, I'm investigating right now, but I meant precisely that: look at Cognito-related metrics present on CloudWatch, no need to set alarms yet.

jrobinso commented 1 year ago

CallCount and ThrottleCallCount are not available under CloudWatch > Metrics for me.

jrobinso commented 1 year ago

Only TokenRefreshSuccesses and SigninSuccesses

brainstorm commented 1 year ago

Yes, neither are for me, I'm trying to see how to enable them from now on if possible.

Screenshot 2023-09-12 at 3 54 34 pm

Sigh... Amazon (see Note in https://docs.aws.amazon.com/cognito/latest/developerguide/metrics-for-cognito-user-pools.html)... I'll go through the AWS CLI and report back.

jrobinso commented 1 year ago

It's late here, I'm signing off. One more observation,, and this might be a coinicdence, but both times I saw it was with a fresh login through the browser, that is username and password fields were not set. On subsequent (successful) logins I was not asked by the browser for username and password, only to select the username as shown in the form below. Again this might be coincidence just noting it.

Screen Shot 2023-09-11 at 10 55 27 PM
brainstorm commented 1 year ago

Now, here's Cognito CallCount:

Screenshot 2023-09-12 at 4 24 45 pm

As you can see we had a single event of ~112 calls last month (less than the official 200 quota), so perhaps that's why nobody in our org reported (so far) a similar issue. I'd love to get to the bottom of this and fix it if possible though!

jrobinso commented 1 year ago

@brainstorm How did you get that report? Can you show me the query?

brainstorm commented 1 year ago

That report is just a series of "click ops" through CloudWatch console (no actual queries involved), here's the sequence:

  1. Cloudwatch->Metrics->All Metrics
  2. Search for "Usage" (write this word on the search box).
  3. Click on "Usage -> By AWS Resource"

Screenshot 2023-09-13 at 3 04 43 pm

Then search for the "Cognito" attributes, define a time range and most of the data points will show up.

Now, for actual queries, Cloudwatch Logs Insights has way better details than what I just showed.

For instance I wanted to determine what happened in that 112 events count spike from my previous plot, so I just went to Cloudwatch->Logs->Logs Insights and launched this query (after selecting a time range and the "CloudTrail/DefaultLogGroup"):

fields @timestamp, @message | sort @timestamp desc
| filter eventSource = "cognito-idp.amazonaws.com"

And then I got a bunch of hits and rich data I could analyze more deeply:

Screenshot 2023-09-13 at 3 10 14 pm

Here you have a few tailored Cognito queries that closely match our purposes so you don't have to construct them yourself:

https://docs.aws.amazon.com/cognito/latest/developerguide/analyzingcteventscwinsight.html

Then, after digging on the data's source IP addresses and user agents, filtering locally with jq, i.e:

jq '.[]."@message" | {"eventTime", "sourceIPAddress", "eventName", "userAgent", "userIdentity"}' logs-insights-narrowing-08-08.json | grep sourceIPAddress | sort | uniq

It turned out that indeed, external agents were spinning automated reconnaissance tools (on AWS instances, actually) and attempting to break-in (unsuccessfully). I could determine that by reverse-resolving (DNS's PTR) a few of the top IP addresses (host <ip address> on any UNIX system).

Therefore, the rate limiting errors you are experiencing could be those external "attacks".

Let me know what you find in your logs and compare notes?

There's also some insightful observations from @reisingerf and @andrewpatto such as the IGV client not gracefully handling that 400 code when the quota of login attempts is exceeded by using some exponential back-off... although that could keep the user waiting for successful login unnecessarily long? Perhaps worth considering while you refactor the OAuth support.

I'll briefly explore solutions for this particular problem although, luckily, we've not been impacted yet by this :/

Note: CloudWatch Logs insights web console data export (download) is capped at 10000 results/records, but there are ways to circumvent that should you want to analyze the data offline... fortunately, in most cases, with a few query tweaks/filters you can just list what you want within AWS.

jrobinso commented 1 year ago

OK thanks, I'll look into this.

This suggestion looks promising

There's also some insightful observations from @reisingerf and @andrewpatto such as the IGV client not gracefully handling that 400 code when the quota of login attempts is exceeded by using some exponential back-off... although that could keep the user waiting for successful login unnecessarily long? Perhaps worth considering while you refactor the OAuth support.

I could look into catching the error, and build in some retries, but its really impossible to test this since it can't be reliably reproduced. A 400 error is rather generic as well. Nonetheless it should be pretty easy to build in some retries. In my case when this occured I logged out and back in and all was good, but I did not try just waiting a few seconds.

The real difficulty here is testing a solution as the behavior itself is hard to reproduce.

jrobinso commented 1 year ago

@brainstorm Also, BTW, the message from AWS at the top of this stack indicates the rate limit is on a per-identity basis.

Here the "Rate Exceeded" error occurred because a single identity-id requested more than 5 GetOpenIdToken calls in a second. To be more specific, the Identity ID sent 8 requests which resulted in RateExceed error.

jrobinso commented 1 year ago

Bingo. Its our bug. GetCognitoAWSCredentials. is called many times, which in turn calls

GetOpenIdTokenResponse openId = cognitoIdentityClient.getOpenIdToken(openidrequest.build());

This needs to be rate limited. The only thing used from this response is the token, which should be valid for 10 minutes. This should not be hard to fix.

EDIT: ok I see what's going on here. GetCognitoAWSCredentials is called for each bucket when building the file picker tree. This in turn triggers the call getOpenIdToken, which is rate limited. The more buckets you have the more likely you are to see this error. I don't see any reason to get new credentials before the existing credentials are expired. This should be an easy fix.

jrobinso commented 1 year ago

I have a PR that should fix this. https://github.com/igvteam/igv/pull/1391/files?diff=split&w=1

brainstorm commented 1 year ago

Nice debugging and fix Jim, ship it! And sorry for the bugs :-S

Also now I see why we have not been hitting this problem: we just list 1-2 buckets on the file picker.