Closed quarterdome closed 9 years ago
Hmm... Interesting. So, after running a few more tests, I'm suspect of Denver machines. /cc @pmeenan
Looking at GitHub in particular, a few notes:
Running same test from different location + IE11 yields: http://www.webpagetest.org/result/141002_5P_5G0/3/details/
There is still a gap, but looking at the tcpdump trace, I don't see anything obviously broken... Except, the ~200ms smells like the Win favorite 200ms ACK delay (sigh): http://support.microsoft.com/kb/214397
[1] openssl s_client -connect assets-cdn.github.com:443 -tls1 -tlsextdebug -status
Firefox is the best at showing SSL timings on WPT right now because OCSP checks actually show up in the waterfall (and AFAIK they have supported stapling since 26).
Not sure it's relevant to the other requests but stapling doesn't appear to work for www.github.com: http://www.webpagetest.org/result/141002_KK_PS0/1/details/
or skipping the redirect, even for github.com: http://www.webpagetest.org/result/141002_7N_PWX/1/details/
I think the issue with IE comes back to the urs.microsoft.com request right after the base page. That is IE doing a check against the "URL Reputation Service" for it's automatic phishing filter and it looks like it blocks making any other requests until that check is complete - yikes!
Thanks, @igrigorik
I also came to a conclusion that the Denver machine is somehow flawed. However, I do have evidence that it is not the only flawed machine out there. Below is a diagram showing 50th, 90th, and 99th percentile for the backend duration (server response time, measured from browser) measured by NewRelic for all of our IE11 users. September 5th is when we switched our site from HTTP to HTTPS. As you can see median moved as expected (100ms or so), however the 90th and 99th percentile seems to show that Denver IE behavior exists for a non trivial amount of users in the wild.
In other words, TLS is not fast yet for 10%+ of users on the web :) My theory is that there is some combination of browser version, toolbar, extension, firewall, or something else that is causing some IE browsers to be extremely slow with TLS.
@pmeenan, urs.microsoft.com is an interesting theory. I never heard about "URL Reputation Service" before. As far as I can see, I do not see the urs.microsoft.com in tcpdump capture. Also not clear why urs.microsoft.com lookup for HTTPS urls would be slower than for HTTP urls.
Sorry, I was referring to Ilya's github waterfall where it is request #3.
stapling doesn't appear to work for for github.com (in FF): http://www.webpagetest.org/result/141002_7N_PWX/1/details/
/cc @mcmanus ... any ideas what could be going wrong here?
However, I do have evidence that it is not the only flawed machine out there. Below is a diagram showing 50th, 90th, and 99th percentile for the backend duration (server response time, measured from browser) measured by NewRelic for all of our IE11 users. September 5th is when we switched our site from HTTP to HTTPS. As you can see median moved as expected (100ms or so), however the 90th and 99th percentile seems to show that Denver IE behavior exists for a non trivial amount of users in the wild.
@quarterdome as a sanity check.. do you have access to full NavTiming data, and can you isolate the TLS connect times? connectEnd - secureConnectionStart should do the trick. Also, have you tried segmenting data by geography or other variables? I'm wondering if there are other factors at play. Do you see same tail impact on other versions of IE + other browsers?
@mcmanus @igrigorik - if you do the openssl check on github (not the static cdn) you can see that no stapling info in included for the main domain. Don't think it's a Firefox issue.
@pmeenan could have sworn I checked that yesterday and it was working.. perhaps I'm hallucinating. /cc @dbussink :)
We don't have stapling on github.com at the moment, only our CDN does (assets-cdn.github.com which is served through Fastly).
@igrigorik Turns out NewRelic had a bug in their data collection agent that particularly affected IE. They fixed it yesterday. I'll wait a day to collect more data, and then try to segment and isolate slow results.
@quarterdome excellent, thanks!
@quarterdome any updates?
@igrigorik, thanks for the ping!
NewRelic fixed their bug, but unfortunately I was not able to find any segment that isolates slow requests. I tried geo location, browser version, device, etc. I also submitted the support ticket with NewRelic to make sure there is no measurement error here, and while they where surprised with results they responded that the measurements are accurate.
I am not sure where to go from here :(
@quarterdome to confirm, sounds like you're still seeing the same % latency bump for IE then? Can you segment by DNS, TCP, etc? We're debugging in the blind here :)
I will need to use different tools to record the DNS, TCP time, etc. NewRelic APM and Isights is not giving me that level of real user monitoring. NewRelic Browser could give me more info, but it is very new and it will take me few days to configure it properly to trace the right things.
Also, to answer your earlier question, I see similar pattern for IE and Safari (but not for Chrome and Firefox). In fact, 99th percentile on Safari is over 30 seconds, which is crazy for backend duration. There is nothing common about these browsers, other than that they are OS default browsers and probably are using OS default SSL stack (rather than using their own).
@quarterdome interesting. Keep us posted, would love to get to the bottom of this.
Unfortunately, I didn't get far with it. I can't reproduce this on any test machines, and can not catch a trace like that in NewRelic. I am out of ideas and time, so almost ready to give up :(
@quarterdome if you get to the bottom of it, let us know.
I ran into this page while looking for a solution with a performance issue we are facing after we moved all of our site to HTTPS. Awesome page, many thanks @igrigorik!
I've been using webpagetest.org a lot to do a crude page load speed of various pages, and I noticed that for a lot of websites there are unexplained waterfall gaps loading SSL pages (after DNS lookup, and before initial connection). For example:
Github: http://www.webpagetest.org/result/140929_VN_1CEM/1/details/ Google: http://www.webpagetest.org/result/140929_HA_1CSD/ Apartment List (our site): http://www.webpagetest.org/result/140929_9M_13KR/1/details/
I don't get these gaps all the time, but I can consistently reproduce them using WebPageTest.org "Denver, Colorado USA - IE 11 - Cable" configuration. We are also using NewRelic RUM on our site, and have evidence that some non trivial amount of users have the same issue in the wild.
At first, I thought it is OSCP issue, but our CDN (CloudFront) is using OSCP stapling, and the weird gap is still there.
Any thoughts on why certain browsers / machines / locations have such a poor performance with regards to HTTPS?