Closed victoriayi closed 1 year ago
How are you passing credentials? Have you setup Cognito as described in the Readme?
Yes we have. There are periods when it works and when it doesn't.
AWS Support gave us this response:
Here the "Rate Exceeded" error occurred because a single identity-id requested more than 5 GetOpenIdToken calls in a second. To be more specific, the Identity ID sent 8 requests which resulted in RateExceed error.
As per my internal team, to resolve this issue, you may attempt to bring the per-identity GetOpenIdToken API call limit under 5 RPS by following below suggestions : a) Setup a retry architecture or exponential back-off mechanism to ensure that they minimize calls on the per-identity basis[2]. b) Split the calls across multiple identities instead of using a single identity for all requests. If not, the GetOpenIdToken requests will keep coming in at a constant rate which will keep returning throttling errors without time for the throttle to cool down. c) We also recommend to look into the possibility of caching the token itself, which can reduce the number of calls to the Identity pool service. The OpenID token is valid for 10 minutes[3].
[2]. Error retries and exponential backoff in AWS https://docs.aws.amazon.com/general/latest/gr/api-retries.html [3]. GetOpenIdToken -
The OpenID token is valid for 10 minutes.
https://docs.aws.amazon.com/cognitoidentity/latest/APIReference/API_GetOpenIdToken.html
It's really one person attempting to load at a given second, so we're unsure what would cause 8 requests to be sent.
Hmm, me either. Can you tell me how exactly are they trying to load? I haven't been able to reproduce it but I am just doing simple loads from the menu picker. I need some steps to reproduce, the more explicit the better.
I haven't been able to reproduce it but I am just doing simple loads from the menu picker. I need some steps to reproduce, the more explicit the better.
+1, we've not seen this in our org (UMCCR) yet, happy to dig further with more details not only about the file being loaded but also the overall configuration of your setup. For instance, I'd like to have a peek at the oauth.config
you are using, since I suspect there might be some sort of (infinite?) redirect loop somewhere?
Sorry for the delay. We wanted to confirm if the error occurs after logging in or after attempting to load the S3 bucket, but for a while we saw no issues again so we couldn't replicate the error. Until today...
Beginning from Apr 20, 2023 in the log: 12:16 Open IGV 12:20 Amazon > Login > login is successful 12:21 Amazon > Load from S3 bucket > we get the unhandled exception.
oauth-config_prod.txt Attach is our oauth.config. (Changed the file extension to .txt in order to upload here & we censored our IDs with asterisks.)
Thank you!
@victoriayi, Can you please share the token refresh timeout you have defined in Cognito?:
You can use the refresh token to retrieve new ID and access tokens. By default, the refresh token expires 30 days after your application user signs into your user pool. When you create an application for your user pool, you can set the application's refresh token expiration to any value between 60 minutes and 10 years.
If it's too short it might be expiring mid-IGVs session and generating the problems you are experiencing since I'm not sure there's a mechanism implemented for this contingency. As a rule of thumb, the refresh token expiry time should be more than your average user's session time (at least, typically longer).
Other things you might want to check, when comparing with our oauth config, I've observed some differences that might also be triggering the issues you are experiencing:
{
"apiKey": "",
"client_id": "*",
"project_id": "*",
"scope": "email%20openid%20profile",
"authorization_endpoint": "https://*.auth.ap-southeast-2.amazoncognito.com/login",
"token_endpoint": "https://*.auth.ap-southeast-2.amazoncognito.com/oauth2/token",
"auth_provider": "Amazon",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_secret": "",
"aws_cognito_fed_pool_id": "ap-southeast-2:*",
"aws_cognito_pool_id": "ap-southeast-2_*",
"aws_cognito_role_arn": "arn:aws:iam::*:role/Cognito_igvAuth_Role",
"aws_region": "ap-southeast-2",
"redirect_uris": [
"http://localhost:60151/oauthCallback"
]
}
Please, pay special attention to the attribute token_endpoint
since the url seems to be lacking the oauth2
path component. Also we don't use a client_secret
. I'd recommend going through our IGV frontend guide and double-check the backend blogpost too.
It'd also strongly recommend upgrading from 2.13.0 (your version) to the latest one as of today, 2.16.0: https://igv.org/download/html/download.html
I'll do the same and check if your issue pops up EDIT: I just tested it with latest IGV and cannot reproduce the issue, please check the oauth config details mentioned above, could be a loop URL redirect issue.
Thanks @brainstorm , minor update, we just released 2.16.1 that fixes a nasty bug affecting some VCF files. So I would suggest updating to that.
Hello,
Authentication flow session duration: 3 minutes Refresh token expiration: 60 minutes Access token expiration: 5 minutes ID token expiration: 5 minutes
Thanks for sharing your oauth config, we'll ask our sysadmin to try making the changes you suggested. (I believe we referred to https://umccr.org/blog/igv-amazon-backend-setup/ to set this up, which also lacks the oauth2
component in its token_endpoint
and uses a client secret... I hope this is indeed the issue though!)
I've also updated my IGV, thank you.
I'd strongly suggest bumping up all those timeouts significantly, I suspect that's the main problem actually.
Here are our production timeouts if that helps? @victoriayi Did you have any other hiccups after our last message exchange?
For third party user reference, those settings can be found here on the AWS console.
And here's a good read about session expiration times: https://www.sjoerdlangkemper.nl/2023/08/16/session-timeout/
Good HN comments too: https://news.ycombinator.com/item?id=37173339
For what its worth, I have encountered this error now testing our standard icognito setup. It's happened twice out of maybe 20 test startups. In both cases logging out and back in via the Amazaon
menu fixed it. The exact error is
Caused by: software.amazon.awssdk.services.cognitoidentity.model.TooManyRequestsException: Rate exceeded (Service: CognitoIdentity, Status Code: 400, Request ID: b7e338a5-ee41-4684-8a1a-b3ca30c0eeaf, Extended Request ID: null)
It looks like my access token expiration is 1 hour, refresh expiration is 30 days.
So, possibly an IGV error here , will keep looking.
My incognito configuration follows
{
"apiKey": "",
"project_id": "igv",
"auth_provider": "Amazon",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"aws_region": "us-east-1",
"scope": "email%20openid%20profile",
"client_id": "*",
"client_secret": "*",
"authorization_endpoint": "https://igv-demo.auth.us-east-1.amazoncognito.com/login",
"token_endpoint": "https://igv-demo.auth.us-east-1.amazoncognito.com/token",
"aws_cognito_fed_pool_id": "us-east-1:2926325e-0938-4dfc-840a-fa8a3e4f63f7",
"aws_cognito_pool_id": "us-east-1_vfq4mnd4t",
"aws_cognito_role_arn": "arn:aws:iam::*:role/Cognito_igvAuth_Role"
}
I've been thinking that perhaps external actors are hitting the authorization_endpoint
more than 200 times in an attempt to bruteforce JWTs, enumerate users or similar external attacks? Can you check your CloudWatch according to these instructions (under Track quota usage
section)?:
https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html
Sorry I'm not finding much on CloudWatch. I don't have any alarms set. Could you be more specific on what to look for?
On Mon, Sep 11, 2023 at 10:08 PM Roman Valls Guimera < @.***> wrote:
I've been thinking that perhaps external actors are hitting the authorization_endpoint more than 200 times in an attempt to bruteforce JWTs, enumerate users or similar external attacks? Can you check your CloudWatch according to these instructions (under Track quota usage section)?:
https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html
— Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/1311#issuecomment-1714980324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHD2HDDKW6I5BLI2FJNENLXZ7U3TANCNFSM6AAAAAAWTEVZ7A . You are receiving this because you commented.Message ID: @.***>
Sure, I should have been more specific... under CloudWatch->Metrics, you can plot a bunch of Cognito-related metrics over time, for instance, that's how it looks like for TokenRefreshSuccess
in three different (colors) UserPools in our system:
It seems like those metrics have only started to tick in the last couple of months for us, I'm investigating right now, but I meant precisely that: look at Cognito-related metrics present on CloudWatch, no need to set alarms yet.
CallCount and ThrottleCallCount are not available under CloudWatch > Metrics for me.
Only TokenRefreshSuccesses and SigninSuccesses
Yes, neither are for me, I'm trying to see how to enable them from now on if possible.
Sigh... Amazon (see Note in https://docs.aws.amazon.com/cognito/latest/developerguide/metrics-for-cognito-user-pools.html)... I'll go through the AWS CLI and report back.
It's late here, I'm signing off. One more observation,, and this might be a coinicdence, but both times I saw it was with a fresh login through the browser, that is username and password fields were not set. On subsequent (successful) logins I was not asked by the browser for username and password, only to select the username as shown in the form below. Again this might be coincidence just noting it.
Now, here's Cognito CallCount:
As you can see we had a single event of ~112 calls last month (less than the official 200 quota), so perhaps that's why nobody in our org reported (so far) a similar issue. I'd love to get to the bottom of this and fix it if possible though!
@brainstorm How did you get that report? Can you show me the query?
That report is just a series of "click ops" through CloudWatch console (no actual queries involved), here's the sequence:
Then search for the "Cognito" attributes, define a time range and most of the data points will show up.
Now, for actual queries, Cloudwatch Logs Insights has way better details than what I just showed.
For instance I wanted to determine what happened in that 112 events count spike from my previous plot, so I just went to Cloudwatch->Logs->Logs Insights
and launched this query (after selecting a time range and the "CloudTrail/DefaultLogGroup"):
fields @timestamp, @message | sort @timestamp desc
| filter eventSource = "cognito-idp.amazonaws.com"
And then I got a bunch of hits and rich data I could analyze more deeply:
Here you have a few tailored Cognito queries that closely match our purposes so you don't have to construct them yourself:
https://docs.aws.amazon.com/cognito/latest/developerguide/analyzingcteventscwinsight.html
Then, after digging on the data's source IP addresses and user agents, filtering locally with jq
, i.e:
jq '.[]."@message" | {"eventTime", "sourceIPAddress", "eventName", "userAgent", "userIdentity"}' logs-insights-narrowing-08-08.json | grep sourceIPAddress | sort | uniq
It turned out that indeed, external agents were spinning automated reconnaissance tools (on AWS instances, actually) and attempting to break-in (unsuccessfully). I could determine that by reverse-resolving (DNS's PTR) a few of the top IP addresses (host <ip address>
on any UNIX system).
Therefore, the rate limiting errors you are experiencing could be those external "attacks".
Let me know what you find in your logs and compare notes?
There's also some insightful observations from @reisingerf and @andrewpatto such as the IGV client not gracefully handling that 400 code when the quota of login attempts is exceeded by using some exponential back-off... although that could keep the user waiting for successful login unnecessarily long? Perhaps worth considering while you refactor the OAuth support.
I'll briefly explore solutions for this particular problem although, luckily, we've not been impacted yet by this :/
Note: CloudWatch Logs insights web console data export (download) is capped at 10000 results/records, but there are ways to circumvent that should you want to analyze the data offline... fortunately, in most cases, with a few query tweaks/filters you can just list what you want within AWS.
OK thanks, I'll look into this.
This suggestion looks promising
There's also some insightful observations from @reisingerf and @andrewpatto such as the IGV client not gracefully handling that 400 code when the quota of login attempts is exceeded by using some exponential back-off... although that could keep the user waiting for successful login unnecessarily long? Perhaps worth considering while you refactor the OAuth support.
I could look into catching the error, and build in some retries, but its really impossible to test this since it can't be reliably reproduced. A 400 error is rather generic as well. Nonetheless it should be pretty easy to build in some retries. In my case when this occured I logged out and back in and all was good, but I did not try just waiting a few seconds.
The real difficulty here is testing a solution as the behavior itself is hard to reproduce.
@brainstorm Also, BTW, the message from AWS at the top of this stack indicates the rate limit is on a per-identity basis.
Here the "Rate Exceeded" error occurred because a single identity-id requested more than 5 GetOpenIdToken calls in a second. To be more specific, the Identity ID sent 8 requests which resulted in RateExceed error.
Bingo. Its our bug. GetCognitoAWSCredentials
. is called many times, which in turn calls
GetOpenIdTokenResponse openId = cognitoIdentityClient.getOpenIdToken(openidrequest.build());
This needs to be rate limited. The only thing used from this response is the token, which should be valid for 10 minutes. This should not be hard to fix.
EDIT: ok I see what's going on here. GetCognitoAWSCredentials is called for each bucket when building the file picker tree. This in turn triggers the call getOpenIdToken
, which is rate limited. The more buckets you have the more likely you are to see this error. I don't see any reason to get new credentials before the existing credentials are expired. This should be an easy fix.
I have a PR that should fix this. https://github.com/igvteam/igv/pull/1391/files?diff=split&w=1
Nice debugging and fix Jim, ship it! And sorry for the bugs :-S
Also now I see why we have not been hitting this problem: we just list 1-2 buckets on the file picker.
igv0.log
Hello, "Amazon > Load from S3 bucket" does not respond. Please see attached log beginning from Apr 04, 2023.