Performance Test - scheduled for 2024-07-17

jffcamp commented 1 month ago

This was the first performance test executed by NeoLoad.

Primary Objective

We will be following scenario AM, which is similar to scenario J but with Advanced Search Configuration and Data Contants being moved from Group-1 to Group-2. This was done to minimize errors during the performance test. Scenario AM and state the primary objective.

The purpose of this test is to validate that ML 11.3 performs sufficiently to be moved to production with the 7/29 Blue/Green switch.

Changes Being Tested

We are primarily testing an upgrade to ML 11.3.0 GA, with the 2024-05-29 dataset.

#162's performance test was in Green with 11.0.3 GA with remnants of a 11.2 nightly build, also with the 2024-05-29 dataset.
#181's performance test was in DEV with 11.2.0 GA, with the 2024-05-14 dataset (questionable).

Other changes compared to both previous perform tests:

https://github.com/project-lux/lux-middletier/issues/77: performance improvement on the search results page
https://github.com/project-lux/lux-middletier/issues/78: from request-group-1 to request-group-2
First full performance test using NeoLoad.

Context

Environment and Versions

Environment: TST which is configured to the BLUE backend.
MarkLogic 11.3.0 GA. As part of the upgrade from ML 11.0.3 GA, the /var/opt/MarkLogic data directory was deleted to ensure there were no remaining remnants of a ML 11.2.0 nightly build.
Backend v1.21.0 WIP. At the time, the only runtime change since v1.20.0 was the split of an error check within _getSearchTermConfig.
Middle Tier 77-78-test-2
Frontend v1.30.0
Dataset produced on 5/29/24

Backend Application Server Configuration

lux-request-group-1 on port 8003: The middle tier is expected to send all requests here document, facets, searchEstimate, searchInfo, searchWillMatch, stats, translate, autoComplate Maximum of 6 concurrent requests.
lux-request-group-2 on port 8004: The middle tier is expected to send relatedList, search, advancedSearchConfig, dataConstants requests here. Maximum of 12 concurrent requests.
Maximum of 18 concurrent requests per node.

Tasks

For more information please see the documentation: LUX Performance Testing Procedure

Prep, Start, and Preliminary Checks

[x] Confirm the most recent blue-green switch is 100% complete (i.e., no part of TST is using PROD).
[x] Deploy code and/or configuration changes that are being tested.
[x] Disable middle-tier caching in TST (instructions).
[x] Verify LUX trace events are enabled plus v8 delay timeout.
[x] Smoke test the front end.
[x] Start collecting OS-level metrics (instructions).
[x] Start collecting middle tier metrics (script)
[x] QA: Verify/set ramp-up schedule to 2 simple search VUs, 1 filtered VU, and 1 entity page VU every three minutes until there are 148 users then hold for 15 minutes <-- initially true but changed after at least one failed test. See this comment for the final ramp-up schedule.
[x] QA: Verify test is configured with three second wait times.
[x] QA: Verify scripts point to TST, https://lux-front-tst.collections.yale.edu/
[x] Team: Sign off on the above before proceeding.
[x] QA: Start performance test. ~ 12:45
[x] Team: begin monitoring for v8 engine crashes.
[x] Team: check total request count at 10 minutes.
[x] Team: check total request count at 15 minutes.
[x] Team: check total request count at 20 minutes.
[x] QA: Finish performance test.

Collect data

[x] Stop collecting OS-level metrics and attach to the ticket.
[x] Stop collecting middle-tier metrics and attach to the ticket.
~[ ] Collect data from AWS and attach to ticket.~
[x] Collect ML monitoring history (instructions).
[x] Collect (script), trim (script), and attach backend logs to the ticket.
[x] Pull app server queue metrics (script), attach to the ticket, and record in Perf: Key Metrics. Starting in ML 11.2.0, this information may be included in the ML monitoring history screenshots and exports.
[x] Update online spreadsheet tabs with what is known at this point.

Restore and Verify Environment

~[ ] Revert this test's code and configuration changes~
[x] Enable middle-tier caching (instructions).
[x] Smoke test the front end.

Analyze

[x] Upon receipt, review report from QA and update related portions of the online spreadsheet tabs.
[x] Mine the backend logs.
[x] Determine if the test is valid: Considered valid despite the NeoLoad implementation of the test being in flux and not directly comparable to the previous tests.
[x] Determine if the performance is acceptable. Team agrees we can go forward with ML 11.3.0 GA.

brent-hartwig commented 1 month ago

ML Monitoring History

Time period: 19:50 - 20:20 UTC (last test of the day with aggressive ramp up)

CPU:

01-cpu

File IO Detail:

02-io

Memory:

03-memory

Intra-cluster activity, 1 of 2:

intra-1-of-2

Intra-cluster activity, 2 of 2:

intra-2-of-2

Data node characteristics for the lux-content database alone:

07-database

Exports:

memory-detail-20240718-175928.xls network-detail-20240718-180345.xls servers-detail-20240718-180623.xls xdqp-server requests detail-20240718-180259.xls cpu-detail-20240718-175455.xls databases-detail-20240718-180643.xls file-i_o detail-20240718-175825.xls

xinjianguo commented 1 month ago

Status code counts 2024-07-17 19:50:00 - 20:20:00 UTC or 15:50:00 - 16:20:00 EDT

CloudFront (non-frontend routes): run from AWS console Athena

select sc_status,count(*) as count
from lux_cloudfront_tst 
where date=date('2024-07-17') and time between '19:50:00' and '20:20:00' 
group by sc_status 
order by sc_status;

WebCache ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_webcache_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;

Middle tier ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_middle_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;

MarkLogic ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_marklogic_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;

brent-hartwig commented 1 month ago

Thanks for the request counts and queries, @xinjianguo!

brent-hartwig commented 1 month ago

Based on the following chart from QA's report, the ramp up schedule was five VUs every minute for 24 minutes until reaching a peak of 120 for just over a minute before a steep drop off (presumed NeoLoad crash). The five VUs per minute were comprised of [TODO: breakdown by flow / transaction].

We switched to a more aggressive ramp up schedule as NeoLoad is having trouble dealing with LUX errors that we believe LoadRunner handled better; QA is in contact with the vendor on the matter.

brent-hartwig commented 1 month ago

Trimmed backend logs: 20240717-blue-as-test-backend-logs-trimmed.zip

brent-hartwig commented 1 month ago

Backend log mining output: 20240717-1950-2020-mined-log-output.zip

brent-hartwig commented 1 month ago

During #181's performance test, node 217 had a higher CPU utilization than the other two nodes (#181's CPU utilization comment). This time, that was the case for node 22: node 22 had 17 of the 18 data points when utilization was over 95%. As part of the upgrade, all of Blue's nodes received new ec2 instances.

brent-hartwig commented 1 month ago

The ratio of requests by type appears to have changed between LoadRunner and NeoLoad. The team concluded the NeoLoad version is still in flux and thus any comparisons could be questionable. Nonetheless, capturing here as the work was done.

Tables supporting the above table:

brent-hartwig commented 1 month ago

Caveated, we experienced a decrease in 504s received by the MarkLogic load balancer.

During 18 Jun's performance test, there were 3x as many 504s at the ML load balancer than ML processed. During this test, it was only 0.22x. This reduced pressure on the data service retry mechanism. Rerouting advancedSearchConfig and dataConstants requests may have helped somewhat; however, due to differences between the LoadRunner and NeoLoad implementations of the test, the ratio of requests by type was significantly different including thousands fewer advancedSearchConfig and dataConstants requests than anticipated. For more on the request type composite change, see this comment.

brent-hartwig commented 1 month ago

Other observations copied from Teams...

Per CPU utilization, we know the NeoLoad test pushed MarkLogic hard and that neither the v8 engine crashed nor MarkLogic process restarted --all good.

But I otherwise find it hard to compare yesterday's test to the previous tests.

A metric I like checking is the number of facet requests per search request. During the 18 Jun test, there were 11.86 facet requests per search request. During yesterday's test, there were only 6.10 facet requests per search request. That's someone concerning given all/most search results tabs have more than six facets.

I find it odd that there were over 18K fewer than expected advanced search configuration and data constant requests (combined) but doubt that had a material effect given they are the equivalent of document requests, as they are all very lightweight requests.

I am surprised that there were any failed advanced search configuration or data constants requests given the successful requests were served up by the second app server and that app server never registered any queued requests.

brent-hartwig commented 1 month ago

Executive Summary

MarkLogic 11.3.0 is approved for production.
We are to rethink our approach to performance testing. We need to avoid making accommodations for the performance test and should have the web cache enabled, as it always is in production.
We are to update the performance test to align with present-day production usage.
No need to spend additional cycles trying to reconcile differences between LoadRunner and NeoLoad.

project-lux / lux-marklogic