yrodiere commented 10 months ago

In the prod environment, requests may take from half a second (bad, but usable) to more than 2 seconds (absurdly slow).

Interestingly, when we tested that environment before, the problem could only be reproduced from outside the cluster, which led us to conclude the network was just slow; but now, this happens even when sending requests from within the app container.

Reproducer

Log into a terminal on prod, then run this:

function query() { curl -sS -o /dev/null -w "$2,%{time_total}\n" "$1/api/guides/search?q=$2"; }; export -f query ; function benchmark() { curl -s https://quarkus.io/guides/hibernate-orm | tr -c '[:alnum:]' ' ' | sed -E 's/\s+/\n/g' | grep -v '^$' | xargs -I '{}' bash -c 'query "$@"' _ "$1" \{\}; }
benchmark localhost:8080

You'll get something like this within the prod container (little to no network overhead!):

DOCTYPE,0.109458
html,0.730111
html,2.202754
head,1.154456
title,0.841987
Using,1.175135
Hibernate,1.077278
ORM,2.202671
and,0.005574
Jakarta,0.792851
Persistence,0.899881
Quarkus,0.402898
title,0.956753
meta,2.203943
charset,2.202902
utf,2.202180
8,0.334238
meta,0.702194
name,2.202897
viewport,0.004739
content,0.282583
width,0.013304
device,0.671223
width,0.013273
initial,2.202420
scale,1.154155
1,0.477228
meta,0.426395
http,0.922965
equiv,2.203212
Content,0.362585
Security,0.344922
Policy,2.202977
content,0.468521
default,2.762873
src,0.279919
https,0.649843
dpm,0.004572
demdex,0.011351
net,2.212220
https,0.424428
analytics,0.974756
ossupstream,0.004484
org,0.899732
https,0.918954
search,0.897728
quarkus,0.599171
io,1.165841
script,2.203324
src,0.295588
self,1.020111
unsafe,0.077396
eval,0.856906
sha256,2.202551
ANpuoVzuSex6VhqpYgsG25OHWVA1I,0.065386
F6aGU04LoI,0.015847
5s,2.206016
sha256,0.967667
ipy9P,0.004608
3rZZW06mTLAR0EnXvxSNcnfSDPLDuh3kzbB1w,0.004902
js,0.943815
bizographics,0.004806
com,2.202516
https,0.668666
www,0.950869
redhat,0.037909
com,1.147529
https,0.678858
static,0.472487
redhat,0.048886
com,2.202363
assets,0.014118
adobedtm,0.004497
com,2.202495
jsonip,0.015979
com,2.203200
https,2.203308
ajax,0.075323
googleapis,0.005189
com,2.277368
https,0.634071
use,0.327698
fontawesome,0.004703
com,2.207481
https,2.203786
app,2.201994
mailjet,0.025175
com,2.203270
http,0.698329
www,0.739247
youtube,0.004310
com,2.323479
http,0.726036
www,0.856315
googleadservices,0.004128
com,2.202558
https,2.202691
googleads,0.033978
g,0.070877
doubleclick,0.007469
net,2.203135
https,2.203513
dpm,0.005090
demdex,0.003698
net,1.140691
https,2.203119
giscus,0.005854
app,0.707628
https,2.202671
analytics,0.445782
ossupstream,0.004443
org,0.706696
style,1.038720
src,0.360500
self,2.203311
https,0.720581
fonts,0.031133
googleapis,0.025645
com,1.134066
https,1.202885
use,0.362669
fontawesome,0.007371
com,2.207774
img,0.021032
src,0.667094
self,0.683208
https,0.909267
analytics,0.413196
ossupstream,0.004325
org,0.475986
media,1.076396
src,0.274180
self,0.894732
frame,2.202541
src,0.576205
https,2.202200
www,2.203063
youtube,0.006180
com,1.090934

yrodiere commented 10 months ago

I suspect we might not be giving enough CPU or memory to the OpenSearch cluster...

Here are the metrics while running the "benchmark":

yrodiere commented 10 months ago

BTW OpenSearch logs only mention "running full sweep" every 5minutes, which is normal: https://forum.opensearch.org/t/jobsweeper-what-does-it-do-and-should-it-run-so-often/2263

We might want to enable slow logs...

yrodiere commented 10 months ago

Hmm, one of the nodes has this though:

[2024-01-17T08:40:32,294][WARN ][o.o.s.b.SearchBackpressureService] [search-backend-2] [monitor_only mode] cancelling task [104082] due to high resource consumption [heap usage exceeded [44.3mb >= 26.3mb]]
[2024-01-17T08:42:57,135][INFO ][o.o.j.s.JobSweeper       ] [search-backend-2] Running full sweep
[2024-01-17T08:47:57,135][INFO ][o.o.j.s.JobSweeper       ] [search-backend-2] Running full sweep
[2024-01-17T08:52:57,135][INFO ][o.o.j.s.JobSweeper       ] [search-backend-2] Running full sweep
[2024-01-17T08:53:17,789][WARN ][o.o.s.b.SearchBackpressureService] [search-backend-2] [monitor_only mode] cancelling task [105614] due to high resource consumption [heap usage exceeded [224.8mb >= 158.8mb]]

yrodiere commented 10 months ago

I just compared numbers with the app running in dev mode on my machine. We definitely can do better with a different setup:

yrodiere commented 10 months ago

147 seems to have improved numbers radically on staging:

DOCTYPE,0.111144
html,0.594007
html,0.628614
head,0.614885
title,0.541877
Using,0.289632
Hibernate,0.691145
ORM,0.690788
and,0.010526
Jakarta,0.309335
Persistence,0.687116
Quarkus,0.301538
title,0.693011
meta,0.436521
charset,0.532264
utf,0.568828
8,0.406567
meta,0.452762
name,0.916533
viewport,0.011984
content,0.335838
width,0.030592
device,0.241723
width,0.019924
initial,0.937839
scale,0.395314
1,0.512188
meta,0.462084
http,0.418967
equiv,0.598068
Content,0.332391
Security,0.358716
Policy,0.526373
content,0.304796
default,2.204182
src,0.501652
https,0.423663
dpm,0.010287
demdex,0.010522
net,0.573289
https,0.444182
analytics,0.273856
ossupstream,0.011330
org,0.351882
https,0.444152
search,0.647035
quarkus,0.331372
io,1.038557
script,0.618166
src,0.332196
self,0.439970
unsafe,0.061414
eval,0.396258
sha256,0.383358
ANpuoVzuSex6VhqpYgsG25OHWVA1I,0.009020
F6aGU04LoI,0.009851
5s,0.429688
sha256,0.361290
ipy9P,0.012367
3rZZW06mTLAR0EnXvxSNcnfSDPLDuh3kzbB1w,0.009243
js,0.606053
bizographics,0.010443
com,0.659405
https,0.410259
www,0.468549
redhat,0.047109
com,0.680510
https,0.540453
static,0.556675
redhat,0.056470
com,0.704743
assets,0.027610
adobedtm,0.007057
com,0.657336
jsonip,0.011655
com,0.711424
https,0.405607
ajax,0.046170
googleapis,0.015735
com,0.702099
https,0.418146
use,0.412289
fontawesome,0.009318
com,0.650506
https,0.449405
app,0.355214
mailjet,0.024929
com,0.669471
http,0.404864
www,0.451040
youtube,0.008561
com,0.661096
http,0.411000
www,0.450282
googleadservices,0.013112
com,0.694944
https,0.465166
googleads,0.008428
g,0.068582
doubleclick,0.008452
net,0.611512
https,0.407844
dpm,0.008046
demdex,0.008407
net,0.591119
https,0.411814
giscus,0.010866
app,0.360517
https,0.485159
analytics,0.310540
ossupstream,0.009325
org,0.349132
style,0.550314
src,0.285312
self,0.401565
https,0.406926
fonts,0.025166
googleapis,0.007839
com,0.703201
https,0.405715
use,0.362942
fontawesome,0.007828
com,0.668983
img,0.031278
src,0.341473
self,0.471215
https,0.425383
analytics,0.267479
ossupstream,0.009663
org,0.366618
media,0.608776
src,0.323472
self,0.390426
frame,0.576041
src,0.274058
https,0.407061
www,0.484173
youtube,0.008575
com,0.726081
https,0.431856
embed,0.426214
restream,0.008863
io,0.652262
https,0.528892
app,0.405835
mailjet,0.022611
com,0.636773
http,0.421257
xy0p2,0.007136
mjt,0.009207
lu,0.107531
https,0.402867
mj,0.009671
quarkus,0.296638
io,0.646709
https,0.504012
giscus,0.015314
app,0.553547
base,0.657725

I'll go ahead and deploy this in production.

yrodiere commented 10 months ago

Fixed by #148

quarkusio / search.quarkus.io

Outrageous variance of latency in prod #146

Reproducer

147 seems to have improved numbers radically on staging: