project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Performance test: May 21st, 2024 - Blue/TST: ML ver11.0.3 downgraded #150

Closed jffcamp closed 1 week ago

jffcamp commented 1 month ago

Primary objective: We are running this performance test to replicate the good performance test results run last year (Scenario J).

Differences since last test: Three performance tests were executed under #132. We were unable to reproduce Scenario J's outcome. This test is expected to be the equivalent of test no. 2 but in Blue and after restarting the ec2 instances. Blue also has ML 11.0.3 with remnants of a nightly build of ML 11.2.0.

  1. TST environment with Green backend resolving name search criteria against primary and alternative names. ML 11.0.3 with remnants of a nightly build of ML 11.2.0.
  2. TST environment with Green backend resolving name search criteria against primary names alone. ML 11.0.3 with remnants of a nightly build of ML 11.2.0.
  3. DEV environment with clean install of ML 11.2.0 GA.

Environment and versions: Blue (as TST) comprised of MarkLogic 11.0.3 (downgraded from 11.2 early release), Backend v1.16.0, Middle Tier v1.1.9, Frontend v1.26, and Dataset produced on 2024-04-18.

Scenario AI of the Perf Test Line Up: our existing dual app server configuration (Scenario J) but after replacing the ec2 instances. ML 11.0.3 environment with remnants of a nightly build of ML 11.2.0.

Key metrics we're targeting (column E / scenario J):

image

Number of application servers: 2 per node. Maximum number of concurrent application server threads:

For more information please see the documentation: LUX Performance Testing Procedure

Tasks to complete:

Data collection (Details from procedure):

Revert all configuration changes:

Verify:

Analysis:

jffcamp commented 1 month ago

One of Brent's grep commands grep -v Info ErrorLog.txt | grep -v Debug

clarkepeterf commented 1 month ago

Middle tier stats:20240521-tst-middle-tier-stats.txt

jffcamp commented 1 month ago

Start Time: 12:57:16 pm EST End Time: 3:06:58 pm EST

xinjianguo commented 1 month ago

on all 3 ML nodes collect OS metrics

cd; cd Apps/LUX/ML $ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.156.217

nohup sudo sar -u -r -o /tmp/sar_${HOSTNAME}_$(date +"%Y-%m-%dT%H%M%S").out 10 >/tmp/sar_${HOSTNAME}_$(date +"%Y-%m-%dT%H%M%S")_screen.out 2>&1 &

$ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.157.111 $ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.254.22

cd /tmp
ls -l sar*05-21*
sudo gzip sar*05-21*

on local desktop:

cd ~/Apps/LUX/marklogic/scripts/logAnalysis
mkdir ~/Apps/LUX/ML/test/20240521
vi collectBackendLogs.sh
./collectBackendLogs.sh
xinjianguo commented 1 month ago

OS metrics

sar_ip-10-5-156-62.its.yale.edu_2024-05-21T165212_screen.out.gz sar_ip-10-5-156-62.its.yale.edu_2024-05-21T165212.out.gz sar_ip-10-5-157-203.its.yale.edu_2024-05-21T165215_screen.out.gz sar_ip-10-5-157-203.its.yale.edu_2024-05-21T165215.out.gz sar_ip-10-5-254-44.its.yale.edu_2024-05-21T165217_screen.out.gz sar_ip-10-5-254-44.its.yale.edu_2024-05-21T165217.out.gz

xinjianguo commented 1 month ago

ML CPU details Screen Shot 2024-05-21 at 3 30 03 PM Screen Shot 2024-05-21 at 3 30 13 PM

ML Memory details Screen Shot 2024-05-21 at 3 35 24 PM Screen Shot 2024-05-21 at 3 35 47 PM Screen Shot 2024-05-21 at 3 35 59 PM Screen Shot 2024-05-21 at 3 36 08 PM

xinjianguo commented 1 month ago

OS CPU Screen Shot 2024-05-21 at 3 16 14 PM

ALB Screen Shot 2024-05-21 at 3 18 29 PM Screen Shot 2024-05-21 at 3 19 06 PM Screen Shot 2024-05-21 at 3 18 56 PM Screen Shot 2024-05-21 at 3 18 45 PM

brent-hartwig commented 1 month ago

Unknown why the request count wasn't closer amongst the nodes. Note that the node with the fewest requests also consistently reported higher CPU utilization.

Node Request Count % of
22 56,590 90%
111 59,499 95%
217 62,721 100%

image

@xinjianguo, the AWS CPU utilization charting is more granular than MarkLogic's. Can you map the ec2 labels from that chart to nodes 22, 111, and 217? I'd like to know if AWS also reflects node 22's CPU was utilized more than the other two. Thank you.

brent-hartwig commented 1 month ago

@xinjianguo, I don't yet see the monitoring history exports and am attaching them now. I have just come to realize that the export links on the detailed views server up different information. As such, I'm attaching what we have always exported, overview-20240521-204411.xls, plus the detailed exports:

xinjianguosccs commented 1 month ago

@brent-hartwig oh I thought we only need the graphs, will capture exports as well

jffcamp commented 1 month ago

Results invalidated. QA lost 10 flows due to a UI change. This caused a single flow failure and prevented all following flows from running.

roamye commented 1 month ago

Approved by UAT

roamye commented 1 week ago

Closing as this ticket was marked as Done the week of 6/3.