Closed yrodiere closed 10 months ago
I suspect we might not be giving enough CPU or memory to the OpenSearch cluster...
Here are the metrics while running the "benchmark":
BTW OpenSearch logs only mention "running full sweep" every 5minutes, which is normal: https://forum.opensearch.org/t/jobsweeper-what-does-it-do-and-should-it-run-so-often/2263
We might want to enable slow logs...
Hmm, one of the nodes has this though:
[2024-01-17T08:40:32,294][WARN ][o.o.s.b.SearchBackpressureService] [search-backend-2] [monitor_only mode] cancelling task [104082] due to high resource consumption [heap usage exceeded [44.3mb >= 26.3mb]]
[2024-01-17T08:42:57,135][INFO ][o.o.j.s.JobSweeper ] [search-backend-2] Running full sweep
[2024-01-17T08:47:57,135][INFO ][o.o.j.s.JobSweeper ] [search-backend-2] Running full sweep
[2024-01-17T08:52:57,135][INFO ][o.o.j.s.JobSweeper ] [search-backend-2] Running full sweep
[2024-01-17T08:53:17,789][WARN ][o.o.s.b.SearchBackpressureService] [search-backend-2] [monitor_only mode] cancelling task [105614] due to high resource consumption [heap usage exceeded [224.8mb >= 158.8mb]]
I just compared numbers with the app running in dev mode on my machine. We definitely can do better with a different setup:
DOCTYPE,0.111144
html,0.594007
html,0.628614
head,0.614885
title,0.541877
Using,0.289632
Hibernate,0.691145
ORM,0.690788
and,0.010526
Jakarta,0.309335
Persistence,0.687116
Quarkus,0.301538
title,0.693011
meta,0.436521
charset,0.532264
utf,0.568828
8,0.406567
meta,0.452762
name,0.916533
viewport,0.011984
content,0.335838
width,0.030592
device,0.241723
width,0.019924
initial,0.937839
scale,0.395314
1,0.512188
meta,0.462084
http,0.418967
equiv,0.598068
Content,0.332391
Security,0.358716
Policy,0.526373
content,0.304796
default,2.204182
src,0.501652
https,0.423663
dpm,0.010287
demdex,0.010522
net,0.573289
https,0.444182
analytics,0.273856
ossupstream,0.011330
org,0.351882
https,0.444152
search,0.647035
quarkus,0.331372
io,1.038557
script,0.618166
src,0.332196
self,0.439970
unsafe,0.061414
eval,0.396258
sha256,0.383358
ANpuoVzuSex6VhqpYgsG25OHWVA1I,0.009020
F6aGU04LoI,0.009851
5s,0.429688
sha256,0.361290
ipy9P,0.012367
3rZZW06mTLAR0EnXvxSNcnfSDPLDuh3kzbB1w,0.009243
js,0.606053
bizographics,0.010443
com,0.659405
https,0.410259
www,0.468549
redhat,0.047109
com,0.680510
https,0.540453
static,0.556675
redhat,0.056470
com,0.704743
assets,0.027610
adobedtm,0.007057
com,0.657336
jsonip,0.011655
com,0.711424
https,0.405607
ajax,0.046170
googleapis,0.015735
com,0.702099
https,0.418146
use,0.412289
fontawesome,0.009318
com,0.650506
https,0.449405
app,0.355214
mailjet,0.024929
com,0.669471
http,0.404864
www,0.451040
youtube,0.008561
com,0.661096
http,0.411000
www,0.450282
googleadservices,0.013112
com,0.694944
https,0.465166
googleads,0.008428
g,0.068582
doubleclick,0.008452
net,0.611512
https,0.407844
dpm,0.008046
demdex,0.008407
net,0.591119
https,0.411814
giscus,0.010866
app,0.360517
https,0.485159
analytics,0.310540
ossupstream,0.009325
org,0.349132
style,0.550314
src,0.285312
self,0.401565
https,0.406926
fonts,0.025166
googleapis,0.007839
com,0.703201
https,0.405715
use,0.362942
fontawesome,0.007828
com,0.668983
img,0.031278
src,0.341473
self,0.471215
https,0.425383
analytics,0.267479
ossupstream,0.009663
org,0.366618
media,0.608776
src,0.323472
self,0.390426
frame,0.576041
src,0.274058
https,0.407061
www,0.484173
youtube,0.008575
com,0.726081
https,0.431856
embed,0.426214
restream,0.008863
io,0.652262
https,0.528892
app,0.405835
mailjet,0.022611
com,0.636773
http,0.421257
xy0p2,0.007136
mjt,0.009207
lu,0.107531
https,0.402867
mj,0.009671
quarkus,0.296638
io,0.646709
https,0.504012
giscus,0.015314
app,0.553547
base,0.657725
I'll go ahead and deploy this in production.
Fixed by #148
In the prod environment, requests may take from half a second (bad, but usable) to more than 2 seconds (absurdly slow).
Interestingly, when we tested that environment before, the problem could only be reproduced from outside the cluster, which led us to conclude the network was just slow; but now, this happens even when sending requests from within the app container.
Reproducer
Log into a terminal on prod, then run this:
You'll get something like this within the prod container (little to no network overhead!):