project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Performance Test - scheduled for 2024-07-17 #226

Closed jffcamp closed 1 month ago

jffcamp commented 1 month ago

This was the first performance test executed by NeoLoad.

Primary Objective

We will be following scenario AM, which is similar to scenario J but with Advanced Search Configuration and Data Contants being moved from Group-1 to Group-2. This was done to minimize errors during the performance test. Scenario AM and state the primary objective.

The purpose of this test is to validate that ML 11.3 performs sufficiently to be moved to production with the 7/29 Blue/Green switch.

Changes Being Tested

We are primarily testing an upgrade to ML 11.3.0 GA, with the 2024-05-29 dataset.

Other changes compared to both previous perform tests:

Context

Environment and Versions

Backend Application Server Configuration

Tasks

For more information please see the documentation: LUX Performance Testing Procedure

Prep, Start, and Preliminary Checks

Collect data

Restore and Verify Environment

Analyze

brent-hartwig commented 1 month ago

ML Monitoring History

Time period: 19:50 - 20:20 UTC (last test of the day with aggressive ramp up)

CPU:

01-cpu

File IO Detail:

02-io

Memory:

03-memory

Intra-cluster activity, 1 of 2:

intra-1-of-2

Intra-cluster activity, 2 of 2:

intra-2-of-2

Data node characteristics for the lux-content database alone:

07-database

Exports:

memory-detail-20240718-175928.xls network-detail-20240718-180345.xls servers-detail-20240718-180623.xls xdqp-server requests detail-20240718-180259.xls cpu-detail-20240718-175455.xls databases-detail-20240718-180643.xls file-i_o detail-20240718-175825.xls

xinjianguo commented 1 month ago

Status code counts 2024-07-17 19:50:00 - 20:20:00 UTC or 15:50:00 - 16:20:00 EDT

CloudFront (non-frontend routes): run from AWS console Athena

select sc_status,count(*) as count
from lux_cloudfront_tst 
where date=date('2024-07-17') and time between '19:50:00' and '20:20:00' 
group by sc_status 
order by sc_status;
Screen Shot 2024-07-18 at 4 12 40 PM

WebCache ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_webcache_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;
Screen Shot 2024-07-18 at 4 01 56 PM

Middle tier ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_middle_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;
Screen Shot 2024-07-18 at 3 59 03 PM

MarkLogic ALB:

select elb_status_code,target_status_code,count(*) as count
from lux_alb_marklogic_blue 
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z' 
group by elb_status_code,target_status_code 
order by elb_status_code,target_status_code;
Screen Shot 2024-07-18 at 4 19 09 PM
brent-hartwig commented 1 month ago

Thanks for the request counts and queries, @xinjianguo!

brent-hartwig commented 1 month ago

Based on the following chart from QA's report, the ramp up schedule was five VUs every minute for 24 minutes until reaching a peak of 120 for just over a minute before a steep drop off (presumed NeoLoad crash). The five VUs per minute were comprised of [TODO: breakdown by flow / transaction].

We switched to a more aggressive ramp up schedule as NeoLoad is having trouble dealing with LUX errors that we believe LoadRunner handled better; QA is in contact with the vendor on the matter.

image

brent-hartwig commented 1 month ago

Trimmed backend logs: 20240717-blue-as-test-backend-logs-trimmed.zip

brent-hartwig commented 1 month ago

Backend log mining output: 20240717-1950-2020-mined-log-output.zip

brent-hartwig commented 1 month ago

During #181's performance test, node 217 had a higher CPU utilization than the other two nodes (#181's CPU utilization comment). This time, that was the case for node 22: node 22 had 17 of the 18 data points when utilization was over 95%. As part of the upgrade, all of Blue's nodes received new ec2 instances.

image

brent-hartwig commented 1 month ago

The ratio of requests by type appears to have changed between LoadRunner and NeoLoad. The team concluded the NeoLoad version is still in flux and thus any comparisons could be questionable. Nonetheless, capturing here as the work was done.

image

Tables supporting the above table:

image

image

image

brent-hartwig commented 1 month ago

Caveated, we experienced a decrease in 504s received by the MarkLogic load balancer.

During 18 Jun's performance test, there were 3x as many 504s at the ML load balancer than ML processed. During this test, it was only 0.22x. This reduced pressure on the data service retry mechanism. Rerouting advancedSearchConfig and dataConstants requests may have helped somewhat; however, due to differences between the LoadRunner and NeoLoad implementations of the test, the ratio of requests by type was significantly different including thousands fewer advancedSearchConfig and dataConstants requests than anticipated. For more on the request type composite change, see this comment.

image

brent-hartwig commented 1 month ago

Other observations copied from Teams...

Per CPU utilization, we know the NeoLoad test pushed MarkLogic hard and that neither the v8 engine crashed nor MarkLogic process restarted --all good.

But I otherwise find it hard to compare yesterday's test to the previous tests.

A metric I like checking is the number of facet requests per search request. During the 18 Jun test, there were 11.86 facet requests per search request. During yesterday's test, there were only 6.10 facet requests per search request. That's someone concerning given all/most search results tabs have more than six facets.

I find it odd that there were over 18K fewer than expected advanced search configuration and data constant requests (combined) but doubt that had a material effect given they are the equivalent of document requests, as they are all very lightweight requests.

I am surprised that there were any failed advanced search configuration or data constants requests given the successful requests were served up by the second app server and that app server never registered any queued requests.

brent-hartwig commented 1 month ago

Executive Summary