Performance test: May 21st, 2024 - Blue/TST: ML ver11.0.3 downgraded

jffcamp commented 1 month ago

Primary objective: We are running this performance test to replicate the good performance test results run last year (Scenario J).

Differences since last test: Three performance tests were executed under #132. We were unable to reproduce Scenario J's outcome. This test is expected to be the equivalent of test no. 2 but in Blue and after restarting the ec2 instances. Blue also has ML 11.0.3 with remnants of a nightly build of ML 11.2.0.

TST environment with Green backend resolving name search criteria against primary and alternative names. ML 11.0.3 with remnants of a nightly build of ML 11.2.0.
TST environment with Green backend resolving name search criteria against primary names alone. ML 11.0.3 with remnants of a nightly build of ML 11.2.0.
DEV environment with clean install of ML 11.2.0 GA.

Environment and versions: Blue (as TST) comprised of MarkLogic 11.0.3 (downgraded from 11.2 early release), Backend v1.16.0, Middle Tier v1.1.9, Frontend v1.26, and Dataset produced on 2024-04-18.

Scenario AI of the Perf Test Line Up: our existing dual app server configuration (Scenario J) but after replacing the ec2 instances. ML 11.0.3 environment with remnants of a nightly build of ML 11.2.0.

Key metrics we're targeting (column E / scenario J):

Number of application servers: 2 per node. Maximum number of concurrent application server threads:

lux-request-group-2: 12 per node for search and related list requests
lux-request-group-1: 6 per node for all other request types
total: 18 per node

For more information please see the documentation: LUX Performance Testing Procedure

Tasks to complete:

~~[ ] Deploy Backend v1.16.0 with the fullTextSearchRelatedFieldName build property set to referenceName.~~
~~[ ] In QC, verify /lib/appConstants.mjs includes const FULL_TEXT_SEARCH_RELATED_FIELD_NAME = 'referenceName'.trim();~~
[x] Disable Blue's middle-tier caching.
[x] Xinjian: Restart the EC2 MarkLogic nodes <-- unique aspect of #150
[x] Verify LUX trace events are enabled plus v8 delay timeout.
[x] Smoke test the front end.
[x] Xinjian: Start collecting OS-level metrics.
[x] Peter: Start collecting middle tier metrics (getMiddleTierStats.sh)
[x] QA: Verify/set ramp-up schedule to 2 simple search VUs, 1 filtered VU, and 1 entity page VU every three minutes until there are 148 users then hold for 15 minutes.
[x] QA: Verify scripts point to BLUE/TST, https://lux-front-tst.collections.yale.edu/
[x] Team: Sign off on the above before proceeding.
[x] QA: Start performance test
[x] QA: Finish performance test

Data collection (Details from procedure):

[x] Xinjian: Stop collecting OS-level metrics and attach to the ticket
[x] Peter: Stop collecting middle-tier metrics and attach to the ticket
[ ] Xinjian: Collect data from AWS and attach to ticket.
[x] Xinjian: Download the monitoring history (level=raw) and attach to the ticket.
[ ] Xinjian: Take screenshots of select monitoring history graphs.
[ ] Brent: Collect, trim, and attach backend logs to the ticket.
[ ] Brent: Pull app server queue metrics, attach to the ticket, and record in Perf: Key Metrics.
[ ] TBD: Update online spreadsheet tabs with what is known at this point.

Revert all configuration changes:

~~[ ] Deploy Backend v1.16.0 with the fullTextSearchRelatedFieldName build property set to referencePrimaryName.~~
[x] Enable middle-tier caching.

Verify:

[ ] Smoke test the front end.

Analysis:

[ ] TBD: Upon receipt, review report from QA and update related portions of the online spreadsheet tabs.
[ ] TBD: Mine the backend logs?
[ ] TBD: Determine if the test is valid.
[ ] TBD: Determine if the performance is acceptable --> #100.

jffcamp commented 1 month ago

One of Brent's grep commands grep -v Info ErrorLog.txt | grep -v Debug

clarkepeterf commented 1 month ago

Middle tier stats:20240521-tst-middle-tier-stats.txt

jffcamp commented 1 month ago

Start Time: 12:57:16 pm EST End Time: 3:06:58 pm EST

xinjianguo commented 1 month ago

on all 3 ML nodes collect OS metrics

cd; cd Apps/LUX/ML $ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.156.217

nohup sudo sar -u -r -o /tmp/sar_${HOSTNAME}_$(date +"%Y-%m-%dT%H%M%S").out 10 >/tmp/sar_${HOSTNAME}_$(date +"%Y-%m-%dT%H%M%S")_screen.out 2>&1 &

$ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.157.111 $ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.254.22

cd /tmp
ls -l sar*05-21*
sudo gzip sar*05-21*

on local desktop:

cd ~/Apps/LUX/marklogic/scripts/logAnalysis
mkdir ~/Apps/LUX/ML/test/20240521
vi collectBackendLogs.sh
./collectBackendLogs.sh

xinjianguo commented 1 month ago

OS metrics

sar_ip-10-5-156-62.its.yale.edu_2024-05-21T165212_screen.out.gz sar_ip-10-5-156-62.its.yale.edu_2024-05-21T165212.out.gz sar_ip-10-5-157-203.its.yale.edu_2024-05-21T165215_screen.out.gz sar_ip-10-5-157-203.its.yale.edu_2024-05-21T165215.out.gz sar_ip-10-5-254-44.its.yale.edu_2024-05-21T165217_screen.out.gz sar_ip-10-5-254-44.its.yale.edu_2024-05-21T165217.out.gz

xinjianguo commented 1 month ago

ML CPU details Screen Shot 2024-05-21 at 3 30 03 PM

ML Memory details Screen Shot 2024-05-21 at 3 35 24 PM Screen Shot 2024-05-21 at 3 36 08 PM

xinjianguo commented 1 month ago

OS CPU Screen Shot 2024-05-21 at 3 16 14 PM

ALB Screen Shot 2024-05-21 at 3 18 29 PM Screen Shot 2024-05-21 at 3 19 06 PM

brent-hartwig commented 1 month ago

Unknown why the request count wasn't closer amongst the nodes. Note that the node with the fewest requests also consistently reported higher CPU utilization.

Node	Request Count	% of
22	56,590	90%
111	59,499	95%
217	62,721	100%

@xinjianguo, the AWS CPU utilization charting is more granular than MarkLogic's. Can you map the ec2 labels from that chart to nodes 22, 111, and 217? I'd like to know if AWS also reflects node 22's CPU was utilized more than the other two. Thank you.

brent-hartwig commented 1 month ago

@xinjianguo, I don't yet see the monitoring history exports and am attaching them now. I have just come to realize that the export links on the detailed views server up different information. As such, I'm attaching what we have always exported, overview-20240521-204411.xls, plus the detailed exports:

xinjianguosccs commented 1 month ago

@brent-hartwig oh I thought we only need the graphs, will capture exports as well

jffcamp commented 1 month ago

Results invalidated. QA lost 10 flows due to a UI change. This caused a single flow failure and prevented all following flows from running.

roamye commented 1 month ago

Approved by UAT

roamye commented 1 week ago

Closing as this ticket was marked as Done the week of 6/3.

project-lux / lux-marklogic

Performance test: May 21st, 2024 - Blue/TST: ML ver11.0.3 downgraded #150