quarkusio / search.quarkus.io

Search backend for Quarkus websites
Apache License 2.0
1 stars 6 forks source link

Slow first connection from quarkus.io website in freshly opened browser #49

Open yrodiere opened 10 months ago

yrodiere commented 10 months ago

It seems that on the very first search in a freshly opened browser, the call to the search endpoint times out... I suspect some cached operation (SSL certificate retrieval?) is way too slow on our current staging environment.

See https://quarkus-website-pr-1825-preview.surge.sh/guides/ , or hit the app directly at https://search-quarkus-io-dev-search-quarkus-io.apps.ospo-osci.z3b1.p1.openshiftapps.com/?q=orm

Also, I just plain can't reach https://quarkus-website-pr-1825-preview.surge.sh/guides/ when I'm on company VPN, so... there's that -_-

yrodiere commented 10 months ago

The main problem seems to be TLS setup.

Initial connection is slow as well, but less exceptionally so.

Firefox

TLS handling takes 700ms!

First connection Next connections
Screenshot from 2023-11-16 16-05-53 Screenshot from 2023-11-16 16-06-01

Chrome

Couldn't really reproduce; the first connection is indeed slow, but TLS handling is nowhere near as slow as what I experienced on Firefox.

First connection Next connections
Screenshot from 2023-11-16 16-03-29 Screenshot from 2023-11-16 16-04-10
mscherer commented 10 months ago

I can take a look, but can you give a bit more details ?

Clicking on https://search-quarkus-io-dev-search-quarkus-io.apps.ospo-osci.z3b1.p1.openshiftapps.com/?q=orm is quite fast from here (I am in the office in Paris, using firefox), I can try again once I am at home, but I want to verify that I am not testing the wrong link or the wrong way.

It could be that the pair of HAproxy got restarted after upgrade and there is some initial cache to fill (not familliar with haproxy but that's a reasonable guess). As you are likely the only user of the dev application, you would have been the 1st one to hit that. We upgrade next on the 21th, so if it slow again, I guess that's just it.

yrodiere commented 10 months ago

Maybe we wait until we have a production environment set up and have a look then. I wouldn't want to waste your time on a problem that turns out to be specific to this environment or to slow ADSL connections like mine...

Clicking on https://search-quarkus-io-dev-search-quarkus-io.apps.ospo-osci.z3b1.p1.openshiftapps.com/?q=orm is quite fast from here (I am in the office in Paris, using firefox), I can try again once I am at home, but I want to verify that I am not testing the wrong link or the wrong way.

The thing is, the problem only occurs on the first connection that involves TLS setup; after that there's some cache or keep-alive that makes it all faster. So if you click, then open the console and reload, you won't see it.

I'd recommend using a freshly started browser, opening a blank tab, opening the dev console (F12) and then copy-pasting the URL into the address bar and pressing enter. Alternatively, wait a few minutes between attempts.

When on Chrome, you don't necessarily have to reset the browser, you can go to chrome://net-internals#sockets and click "Close idle sockets" between attempts. But I couldn't really reproduce this problem on Chrome, so...

It could be that the pair of HAproxy got restarted after upgrade and there is some initial cache to fill (not familliar with haproxy but that's a reasonable guess).

Might be just that, yes. Let's hope so. We had this problem a few days ago, too... would that match your upgrade schedule?

As you are likely the only user of the dev application, you would have been the 1st one to hit that. We upgrade next on the 21th, so if it slow again, I guess that's just it.

I'm at home with an ADSL connection, so admittedly everything is kind of slow and latency can get ghastly. But as you can see from my comment above, the TLS setup can get much slower than the rest. However, I just tried again, multiple times, and could not get something as slow as in my comment above. TLS handling tops out at ~200ms.

Others seem to have experienced this, though. Like @michelle-purcell, though I'm not sure what her connection was exactly. We suspect being connected to company VPN was part of the issue for her, but then that's just a guess (I noticed surge.sh, where the app using the REST API is hosted, wasn't reacheable when on VPN, so that's weird).

yrodiere commented 8 months ago

We had to raise timeouts because network had high latency even beyond the first connection, so this is no longer really relevant.

I'll close, let's reopen if it becomes significant again.

yrodiere commented 3 weeks ago

Reopening as I saw mentions of slowness again on #320 , but I don't have the details -- I think most of it was discussed privately between Marko and Guillaume, who would know more?

As to the initial slow response times and switching to the local search as a result, it may be related to https://github.com/quarkusio/search.quarkus.io/issues/49

This led me to the following suggestion:

Can't the initial slow response simply come from OpenSearch doing some lazy initialization? Or some classloading in the Quarkus app on first request?

Which turned out to be stupid:

As for the slowness, I don't think it's class loading as it's on quarkus.io and I don't suppose we spawn a new app for every request.

So from what I gather, the slowness happens:

My questions would be:

  1. what are the symptoms of the "failure" exactly: no results, error message, ... ?
  2. what's the browser (see above, it might matter)?
  3. is there any way to reproduce this semi-reliably (I can try a few dozen times if needed)?
  4. does network debugging in the browser (pressing F12) show anything weird (like above)?

@gsmet when you have some time, any info would help. Thanks

gsmet commented 3 weeks ago

So the error was due to the local search being broken (that's what Marko fixed).

As for what I get with the first connection, here it is (I typed Hibern):

Screenshot from 2024-08-28 10-17-06

gsmet commented 3 weeks ago

I dunno what the blocked part is...

The DNS part + TLS setup is very slow. The DNS part could be due to low TTL.

From what I can see search.quarkus.io has a TTL of 1 hour but... the host behind the alias has a TTL of 60 seconds - which could be explained by the dynamic nature of OpenShift hosting. But in this case, you need a very fast DNS and I wonder if it could be the issue (at least part of it).

As for the TLS setup, I dunno why it's so slow - if it could be related to the TTL too or if it's something different that makes it slow.

I'm wondering if we should host this thing behind CloudFlare or similar - ideally with a ping query every 50s to make sure the DNS stays in cache (but I'm not sure if this can be done).

gsmet commented 3 weeks ago

CloudFlare has a free plan that maybe we should explore. It looks sufficient for us and they tend to be very nice.

Now that means putting the entire domain (I tried with a subdomain) behind CloudFlare and handling our DNS there. Or use another domain.

yrodiere commented 3 weeks ago

CloudFlare has a free plan that maybe we should explore. It looks sufficient for us and they tend to be very nice.

That's a nice idea, but just to clarify that's for the DNS part only, right? It doesn't get rid of the TLS setup problem, since we need communication between Cloudflare and the app to be secure as well... Unless I misunderstand how it all works.

Now that means putting the entire domain (I tried with a subdomain) behind CloudFlare and handling our DNS there. Or use another domain.

:shrug: fine by me, but then I'm not the one dealing with the domain setup, so I'm hardly the one to convince.

mscherer commented 3 weeks ago

We discussed a bit by mail, and before starting to a maybe painful move of DNS, I would suggest to add a search2.quarkus.io route and hardocde the current IP with a TTL of a few hours instead of a CNAME. This would remove the elb DNS resolution from the equation and allow a test without disrupting anyone.

But in the end, I think the biggest problem is that the containers are running on a Openshift cluster in Oregon.

mscherer commented 3 weeks ago

I asked my boss to test from San Francisco, and indeed, that's much better (DNS was 0, I didn't search more, assuming some local DNS cache): 32 ms for connecting, 50 ms for TLS setup, 117 ms for waiting.

So this kinda show that the latency is caused by the cluster being in Oregon.

One idea could be to have reverse proxies (like varnish) in DC closer to users (for example, one in Europe). The proxy could keep a pool of persistent connection to the backend, thus saving the DNS resolution time and TLS setup, as it would be done before the 1st request. And since it would be closer to some users, the round trip for initial connection would be faster, saving ~400ms.

yrodiere commented 3 weeks ago

We discussed a bit by mail, and before starting to a maybe painful move of DNS, I would suggest to add a search2.quarkus.io route and hardocde the current IP with a TTL of a few hours instead of a CNAME. This would remove the elb DNS resolution from the equation and allow a test without disrupting anyone.

I just gave it a try but I failed to find information on how to configure that on the web; routes targeting IPs directly don't seem a common use case. Should I set up a LoadBalancer service with a loadBalancerIp? Or were you thinking of some configuration directly in the Route?

FWIW the configuration is there: https://console-openshift-console.apps.ospo-osci.z3b1.p1.openshiftapps.com/k8s/ns/prod-search-quarkus-io/route.openshift.io~v1~Route

One idea could be to have reverse proxies (like varnish) in DC closer to users (for example, one in Europe). The proxy could keep a pool of persistent connection to the backend, thus saving the DNS resolution time and TLS setup, as it would be done before the 1st request. And since it would be closer to some users, the round trip for initial connection would be faster, saving ~400ms.

Alternatively, considering this seems to affect the first request only... and this may be dumb, but hear me out... we send a dummy request (HEAD?) from the client to the search service in the background, when the quarkus.io page opens. Which should be relatively safe, since https://quarkus.io/guides is pretty much useless without search, so people ending up there will necessarily use the search service. That wouldn't solve Guillaume's use case of re-using the same page for a search a few minutes later, but at least this solution is relatively simple to implement.

yrodiere commented 3 weeks ago

Alternatively, considering this seems to affect the first request only... and this may be dumb, but hear me out... we send a dummy request (HEAD?) from the client to the search service in the background, when the quarkus.io page opens. Which should be relatively safe, since https://quarkus.io/guides is pretty much useless without search, so people ending up there will necessarily use the search service.

Ooooh I just discovered this (yes I'm illiterate in HTML): https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/rel/preconnect . I'll send a PR to try to use it on quarkus.io.

mscherer commented 3 weeks ago

I just gave it a try but I failed to find information on how to configure that on the web; routes targeting IPs directly don't seem a common use case.

I was more thinking on adding a 2nd route in the namespace with the new vhost, and add a new DNS that directly resolve to the IP (a A record instead of a CNAME). I do not think we can directly set a route object that answer to the IP (you would need a ingree object, I think ?).

yrodiere commented 3 weeks ago

Ooooh I just discovered this (yes I'm illiterate in HTML): https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/rel/preconnect . I'll send a PR to try to use it on quarkus.io.

Done: https://github.com/quarkusio/quarkusio.github.io/pull/2104 It seems to help on chrome, at least. For some reason when testing on Firefox it didn't seem to have any effect :shrug:

I was more thinking on adding a 2nd route in the namespace with the new vhost, and add a new DNS that directly resolve to the IP (a A record instead of a CNAME). I do not think we can directly set a route object that answer to the IP (you would need a ingree object, I think ?).

Ok; I'll need someone else to create that DNS record because I don't have control over the DNS.

FWIW we also have a staging env where we don't use an external DNS: https://search-quarkus-io-dev-search-quarkus-io.apps.ospo-osci.z3b1.p1.openshiftapps.com/api/guides/search?q=hiber

yrodiere commented 3 weeks ago

So the error was due to the local search being broken (that's what Marko fixed).

As for what I get with the first connection, here it is (I typed Hibern):

Screenshot from 2024-08-28 10-17-06

I opened #324 to try to reduce the "waiting" time.

yrodiere commented 5 days ago

So the error was due to the local search being broken (that's what Marko fixed).

As for what I get with the first connection, here it is (I typed Hibern):

Screenshot from 2024-08-28 10-17-06

FWIW we made some progress on the waiting part in #324, here's where we're standing right now (same search, Hiber):

image

Notes:

yrodiere commented 5 days ago
  • The actual improvement I'm trying to showcase here is the "Waiting" part: 349 => 236. Still not great, but we're in a better place.

  • When running the same search on a local instance of the app (dev mode) I get ~50 to ~80 ms of "Waiting"

  • When running the same request from within the prod container (localhost) using curl, I also get ~80ms of "Waiting", so the difference (~150ms) must be network overhead.

See also https://github.com/quarkusio/search.quarkus.io/issues/324#issuecomment-2348805450, I suspect some (all?) of the improvement actually results from OpenSearch being redeployed on a node that favors performance (either better hardware, less noise from containers on the same node, or just a node closer to the app).