igrigorik / istlsfastyet.com

Is TLS fast yet? Yes, yes it is.
https://istlsfastyet.com
421 stars 89 forks source link

Gap in page load waterfall related to SSL #52

Closed quarterdome closed 9 years ago

quarterdome commented 10 years ago

I ran into this page while looking for a solution with a performance issue we are facing after we moved all of our site to HTTPS. Awesome page, many thanks @igrigorik!

I've been using webpagetest.org a lot to do a crude page load speed of various pages, and I noticed that for a lot of websites there are unexplained waterfall gaps loading SSL pages (after DNS lookup, and before initial connection). For example:

Github: http://www.webpagetest.org/result/140929_VN_1CEM/1/details/ Google: http://www.webpagetest.org/result/140929_HA_1CSD/ Apartment List (our site): http://www.webpagetest.org/result/140929_9M_13KR/1/details/

I don't get these gaps all the time, but I can consistently reproduce them using WebPageTest.org "Denver, Colorado USA - IE 11 - Cable" configuration. We are also using NewRelic RUM on our site, and have evidence that some non trivial amount of users have the same issue in the wild.

At first, I thought it is OSCP issue, but our CDN (CloudFront) is using OSCP stapling, and the weird gap is still there.

Any thoughts on why certain browsers / machines / locations have such a poor performance with regards to HTTPS?

igrigorik commented 10 years ago

Hmm... Interesting. So, after running a few more tests, I'm suspect of Denver machines. /cc @pmeenan

Looking at GitHub in particular, a few notes:

Running same test from different location + IE11 yields: http://www.webpagetest.org/result/141002_5P_5G0/3/details/

image

There is still a gap, but looking at the tcpdump trace, I don't see anything obviously broken... Except, the ~200ms smells like the Win favorite 200ms ACK delay (sigh): http://support.microsoft.com/kb/214397

[1] openssl s_client -connect assets-cdn.github.com:443 -tls1 -tlsextdebug -status

pmeenan commented 10 years ago

Firefox is the best at showing SSL timings on WPT right now because OCSP checks actually show up in the waterfall (and AFAIK they have supported stapling since 26).

Not sure it's relevant to the other requests but stapling doesn't appear to work for www.github.com: http://www.webpagetest.org/result/141002_KK_PS0/1/details/

or skipping the redirect, even for github.com: http://www.webpagetest.org/result/141002_7N_PWX/1/details/

I think the issue with IE comes back to the urs.microsoft.com request right after the base page. That is IE doing a check against the "URL Reputation Service" for it's automatic phishing filter and it looks like it blocks making any other requests until that check is complete - yikes!

quarterdome commented 10 years ago

Thanks, @igrigorik

I also came to a conclusion that the Denver machine is somehow flawed. However, I do have evidence that it is not the only flawed machine out there. Below is a diagram showing 50th, 90th, and 99th percentile for the backend duration (server response time, measured from browser) measured by NewRelic for all of our IE11 users. September 5th is when we switched our site from HTTP to HTTPS. As you can see median moved as expected (100ms or so), however the 90th and 99th percentile seems to show that Denver IE behavior exists for a non trivial amount of users in the wild.

backend duration

In other words, TLS is not fast yet for 10%+ of users on the web :) My theory is that there is some combination of browser version, toolbar, extension, firewall, or something else that is causing some IE browsers to be extremely slow with TLS.

@pmeenan, urs.microsoft.com is an interesting theory. I never heard about "URL Reputation Service" before. As far as I can see, I do not see the urs.microsoft.com in tcpdump capture. Also not clear why urs.microsoft.com lookup for HTTPS urls would be slower than for HTTP urls.

pmeenan commented 10 years ago

Sorry, I was referring to Ilya's github waterfall where it is request #3.

igrigorik commented 10 years ago

stapling doesn't appear to work for for github.com (in FF): http://www.webpagetest.org/result/141002_7N_PWX/1/details/

/cc @mcmanus ... any ideas what could be going wrong here?

However, I do have evidence that it is not the only flawed machine out there. Below is a diagram showing 50th, 90th, and 99th percentile for the backend duration (server response time, measured from browser) measured by NewRelic for all of our IE11 users. September 5th is when we switched our site from HTTP to HTTPS. As you can see median moved as expected (100ms or so), however the 90th and 99th percentile seems to show that Denver IE behavior exists for a non trivial amount of users in the wild.

@quarterdome as a sanity check.. do you have access to full NavTiming data, and can you isolate the TLS connect times? connectEnd - secureConnectionStart should do the trick. Also, have you tried segmenting data by geography or other variables? I'm wondering if there are other factors at play. Do you see same tail impact on other versions of IE + other browsers?

pmeenan commented 10 years ago

@mcmanus @igrigorik - if you do the openssl check on github (not the static cdn) you can see that no stapling info in included for the main domain. Don't think it's a Firefox issue.

igrigorik commented 10 years ago

@pmeenan could have sworn I checked that yesterday and it was working.. perhaps I'm hallucinating. /cc @dbussink :)

dbussink commented 10 years ago

We don't have stapling on github.com at the moment, only our CDN does (assets-cdn.github.com which is served through Fastly).

quarterdome commented 10 years ago

@igrigorik Turns out NewRelic had a bug in their data collection agent that particularly affected IE. They fixed it yesterday. I'll wait a day to collect more data, and then try to segment and isolate slow results.

igrigorik commented 10 years ago

@quarterdome excellent, thanks!

igrigorik commented 10 years ago

@quarterdome any updates?

quarterdome commented 10 years ago

@igrigorik, thanks for the ping!

NewRelic fixed their bug, but unfortunately I was not able to find any segment that isolates slow requests. I tried geo location, browser version, device, etc. I also submitted the support ticket with NewRelic to make sure there is no measurement error here, and while they where surprised with results they responded that the measurements are accurate.

I am not sure where to go from here :(

igrigorik commented 10 years ago

@quarterdome to confirm, sounds like you're still seeing the same % latency bump for IE then? Can you segment by DNS, TCP, etc? We're debugging in the blind here :)

quarterdome commented 10 years ago

I will need to use different tools to record the DNS, TCP time, etc. NewRelic APM and Isights is not giving me that level of real user monitoring. NewRelic Browser could give me more info, but it is very new and it will take me few days to configure it properly to trace the right things.

Also, to answer your earlier question, I see similar pattern for IE and Safari (but not for Chrome and Firefox). In fact, 99th percentile on Safari is over 30 seconds, which is crazy for backend duration. There is nothing common about these browsers, other than that they are OS default browsers and probably are using OS default SSL stack (rather than using their own).

igrigorik commented 10 years ago

@quarterdome interesting. Keep us posted, would love to get to the bottom of this.

quarterdome commented 10 years ago

Unfortunately, I didn't get far with it. I can't reproduce this on any test machines, and can not catch a trace like that in NewRelic. I am out of ideas and time, so almost ready to give up :(

igrigorik commented 9 years ago

@quarterdome if you get to the bottom of it, let us know.