mozilla / persona-yahoo-bridge

A ProxyIdP service for bridging major IdPs who lack support for the BrowserID protocol.
27 stars 15 forks source link

First time failure #179

Closed callahad closed 11 years ago

callahad commented 11 years ago

I can't repo this yet, but I watched a first-time bigtent user on completely fail to authenticate.

What happened:

  1. Incognito window, Chrome 27, Linux.
  2. 123done, clicked sign-in
  3. Typed a new-to-Persona Yahoo address
  4. Primary transition screen
  5. Yahoo auth page
  6. Yahoo popover captcha
  7. Yahoo OpenID permission screen
  8. Error authenticating with assertion

The user was moving somewhat slowly, as other people were hanging around and chatting. Perhaps sometime timed out?

Upon closing the popup and re-starting the flow, the user was able to successfully log in by just typing their email address.

shane-tomlinson commented 11 years ago

Is it the same as https://github.com/mozilla/browserid/issues/3145?

callahad commented 11 years ago

EDIT: This is not a dupe. See below.

callahad commented 11 years ago

Lloyd and I have been able to reproduce this.

Screen Shot 2013-04-04 at 2 50 15 PM

callahad commented 11 years ago

This is on line 2920 of bidbundle.js, from line 79 of jwcrypto's assertion.js:

exports.verify = function(signedObject, publicKey, now, cb) {
  jwcrypto.verify(signedObject, publicKey, function(err, payload) {
    if (err) return cb(err);

    var assertionParams = extractAssertionParamsFrom(payload);

    // check iat
    if (assertionParams.issuedAt) {
      if (assertionParams.issuedAt.valueOf() > now.valueOf())
        return cb("assertion issued later than verification date");
    }

    // check exp expiration
    if (assertionParams.expiresAt) {
      if (assertionParams.expiresAt.valueOf() < now.valueOf())
        return cb("assertion has expired");
    }

    cb(null, payload, assertionParams);
  });
};
jbonacci commented 11 years ago

+1 for good first bugs It sure looks like a dup until you read the error more carefully. Really interesting...

callahad commented 11 years ago

FWIW, auth_with_assertion returned that failure with an HTTP 200

callahad commented 11 years ago

Request Headers:

POST /wsapi/auth_with_assertion HTTP/1.1
Host: login.persona.org
Connection: keep-alive
Content-Length: 1899
BrowserID-Version: 0d62ce6
Accept: application/json;text/plain
Origin: https://login.persona.org
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.6 Safari/537.36
Content-type: application/json
Referer: https://login.persona.org/sign_in
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: browserid_state=X8Omw5ouGjjDrGBvw5cTAHVbwrjDkQ.NcOewpLCmQPCpMKBZMKfw4rCrMOow5FFESvCkMO3G1EuwpnCk350wrvCnyRTK8K5ZcO3VxFSwr97w5w5woULMMOnf1Zybw.1365104945333.2419200000.wrIIw6dcw4p9YcO5wpvDixYXRcOjw41yEwMABHh0wqPCkzrCumE9w4RNwoMC

Request Payload:

{"assertion":"eyJhbGciOiJSUzI1NiJ9.eyJwdWJsaWMta2V5Ijp7ImFsZ29yaXRobSI6IkRTIiwieSI6IjdmNjA4MmI3YjMxNmU3MmJmZTRiZjkzYTA0ZjBlN2RlNzMwMTMxYzE1MzdlOGVmY2FjNWU5ODI0Yzg3MDM3ZTM3YzAyOGUyZjFiMDQ3OTE1ZjY0ODUyMzViMjUzNWMzYjlmNmU0OTIyNWQ1Y2M3NGFmMzlkOGQxMjJmZWM2MGRmZDJhZThmYmRhMDA5YjE4YzNiNWEwMzQ1ZWFkYWE0ODc5MWM4Nzc4ODIzMzA2ZjIxNjM1YmFmZDYxY2MwM2U1OWY5ZTdhYzUyOTE4NjI2MGY1OGI5OWNlZmRkZDAwNTEwMGM0ZDBlY2JmODAxMmYxZTQ0MWRmM2JjYjRlYTY4ODAiLCJwIjoiZmY2MDA0ODNkYjZhYmZjNWI0NWVhYjc4NTk0YjM1MzNkNTUwZDlmMWJmMmE5OTJhN2E4ZGFhNmRjMzRmODA0NWFkNGU2ZTBjNDI5ZDMzNGVlZWFhZWZkN2UyM2Q0ODEwYmUwMGU0Y2MxNDkyY2JhMzI1YmE4MWZmMmQ1YTViMzA1YThkMTdlYjNiZjRhMDZhMzQ5ZDM5MmUwMGQzMjk3NDRhNTE3OTM4MDM0NGU4MmExOGM0NzkzMzQzOGY4OTFlMjJhZWVmODEyZDY5YzhmNzVlMzI2Y2I3MGVhMDAwYzNmNzc2ZGZkYmQ2MDQ2MzhjMmVmNzE3ZmMyNmQwMmUxNyIsInEiOiJlMjFlMDRmOTExZDFlZDc5OTEwMDhlY2FhYjNiZjc3NTk4NDMwOWMzIiwiZyI6ImM1MmE0YTBmZjNiN2U2MWZkZjE4NjdjZTg0MTM4MzY5YTYxNTRmNGFmYTkyOTY2ZTNjODI3ZTI1Y2ZhNmNmNTA4YjkwZTVkZTQxOWUxMzM3ZTA3YTJlOWUyYTNjZDVkZWE3MDRkMTc1ZjhlYmY2YWYzOTdkNjllMTEwYjk2YWZiMTdjN2EwMzI1OTMyOWU0ODI5YjBkMDNiYmM3ODk2YjE1YjRhZGU1M2UxMzA4NThjYzM0ZDk2MjY5YWE4OTA0MWY0MDkxMzZjNzI0MmEzODg5NWM5ZDViY2NhZDRmMzg5YWYxZDdhNGJkMTM5OGJkMDcyZGZmYTg5NjIzMzM5N2EifSwicHJpbmNpcGFsIjp7ImVtYWlsIjoiZHBjbW96cGlwZXNAeWFob28uY29tIn0sImlhdCI6MTM2NTEwNDk2NDg3NiwiZXhwIjoxMzY1MTA1Mjc0ODc2LCJpc3MiOiJ5YWhvby5sb2dpbi5wZXJzb25hLm9yZyJ9.fTfXmMoGQBhoU8R2bQLTITQB_Ir6iQGKqpmRrrJiWzPkEekbQ9Ins6_-SPMYXKDEnYpvbv9WQgbSt4LODptMagFokNLgBgNN-pYloQ0e8T6xF-pxCqtZOm09t1Gl9rcKIFDZxWWi0stDXcYdtBQDgWh2VRskPMjyfZVPWhFT-MCIJgNtlU944SsA0z9r7-vRK_C0rzoYQUsgbZsEazsYOmwK0NNOv_crx3P7QN9a7pae-RHPCOJOqtcE8gGMaUQveaoi4qAqKVHXSLbPRoKgQsVITVbBXkda5GexRQ7E5uukcpInP59-Syt5O3k9PrZYIC_UaOWMgkolTvhH4NvVQw~eyJhbGciOiJEUzEyOCJ9.eyJleHAiOjEzNjUxMDUwNzc5NjksImF1ZCI6Imh0dHBzOi8vbG9naW4ucGVyc29uYS5vcmcifQ.bGAz_gDzew466ACDWxz-fg-Z4dq9bdXsYdcyGba1c7ziklSHeXuE9Q","ephemeral":false,"csrf":"/cJM1t8XkwhhcY2eYg6Q3g=="}

Response Headers:

HTTP/1.1 200 OK
Cache-Control: no-cache, max-age=0
Content-Type: application/json; charset=utf-8
Strict-Transport-Security: max-age=10886400; includeSubdomains
Date: Thu, 04 Apr 2013 19:49:18 GMT
Connection: keep-alive
X-Frame-Options: DENY
Content-Length: 74

Response Body:

{"success":false,"reason":"assertion issued later than verification date"}
callahad commented 11 years ago

Solution: slop has to be larger than the largest possible divergence from webhead hosts we run.

callahad commented 11 years ago

This is a beta 2 blocker, and requires a hotfix to give verification some room for slop.

Looking into alternative ways to solve this.

callahad commented 11 years ago

The six bigtent servers had slightly diverging clocks, causing intermittent failures for Yahoo users.

We're confident that relative skew between bigtent instances causes this issue to manifest. As an immediate mitigation strategy, Gene changed the security groups for ntpd so that it can talk to the broader internet, which should keep things in much tighter sync.

We still need to update jwcrypto to allow configurable skew, but this is no longer a blocking issue.

shane-tomlinson commented 11 years ago

@callahad - I had thought that @ozten updated jwcrypto to handle clock skew up to two minutes in the future. If he has, has that version of jwcrypto been included into the list of browserid dependencies, or are we still using the old version? Were his changes limited to certificates, or did they cover assertions as well?

callahad commented 11 years ago

@shane-tomlinson Honestly, I don't know. I thought we were handling that too, but apparently not in the currently-deployed code. Everyone was more or less in full-blown "how do we fix this without a new deployment" mode. We haven't yet done a full analysis of how/why this broke.

ozten commented 11 years ago

@callahad - I had thought that @ozten updated jwcrypto to handle clock skew up to two minutes in the future. If he has, has that version of jwcrypto been included into the list of browserid dependencies, or are we still using the old version? Were his changes limited to certificates, or did they cover assertions as well?

Yes, the short term fix was in the certifier. It added a 10 second buffer into the future to allow for clock skew.

ozten commented 11 years ago

We're confident that relative skew between bigtent instances causes this issue to manifest. As an immediate mitigation strategy, Gene changed the security groups for ntpd so that it can talk to the broader internet, which should keep things in much tighter sync.

This came up during our QA pass. @gene1wood had implemented ntpd on stage. Servers should not be able to skew more than 10 seconds. Did we not apply this same thing to prod? Is 10 seconds still a good number? If we missed prod, I think we can close this as fixed.

callahad commented 11 years ago

Great! That explains it. One of the servers was 14 seconds in the future when we caught the issue, which would not have been mitigated by the current short term fix.

Sounds like the next step is to adjust jwcrypto to allow configurable clock skew allowances.

ozten commented 11 years ago

Let's have Gene weight in, as we already had "fixed this".

callahad commented 11 years ago

@ozten ...But we hadn't fixed it, had we? We fixed it for within 10 seconds, but the servers were more than 10 seconds apart.

ozten commented 11 years ago

Our contract with Ops was that we could keep all our infra as well as AWS machines within 10 seconds via NTP.

gene1wood commented 11 years ago

What happened with the AWS bigtent machines is that i'd enabled ntp and thought I'd fixed the problem but due to a security group block of outbound port 123/udp, ntpd couldn't talk to any time sources and so the machines began to drift again. I've now fixed the access issue at mozilla/identity-ops@d6d19a1a7d5ec83969c514405bfef99feb380444 and confirmed that all of our bigtent servers are in time sync and that all ntpd daemons are actively talking to external time sources. In the future we'll want to establish internal time sources, close that port 123/udp opening and fetch time internally.

ozten commented 11 years ago

Do you agree that I should close this bug, as 10 seconds is ample clock skew buffer?

gene1wood commented 11 years ago

Yes, with working ntpd on our bigtent servers 10 seconds is plenty. The scenario that Lloyd and Dan were thinking through involved a sub 1 second skew requirement on the webheads. I'm not sure what the final thoughts on that were however since the cause ended up being the > 10 second skew on bigtent (not < 1 second skew on webheads)

callahad commented 11 years ago

If @lloyd's hypothesis is correct regarding webhead skew, we still have this issue, we're just hacking around it by backdating assertions...

ozten commented 11 years ago

As noted in the original bugs, the fix in production is a short term fix.