Closed jffcamp closed 1 month ago
Time period: 19:50 - 20:20 UTC (last test of the day with aggressive ramp up)
CPU:
File IO Detail:
Memory:
Intra-cluster activity, 1 of 2:
Intra-cluster activity, 2 of 2:
Data node characteristics for the lux-content
database alone:
Exports:
memory-detail-20240718-175928.xls network-detail-20240718-180345.xls servers-detail-20240718-180623.xls xdqp-server requests detail-20240718-180259.xls cpu-detail-20240718-175455.xls databases-detail-20240718-180643.xls file-i_o detail-20240718-175825.xls
Status code counts 2024-07-17 19:50:00 - 20:20:00 UTC or 15:50:00 - 16:20:00 EDT
CloudFront (non-frontend routes): run from AWS console Athena
select sc_status,count(*) as count
from lux_cloudfront_tst
where date=date('2024-07-17') and time between '19:50:00' and '20:20:00'
group by sc_status
order by sc_status;
WebCache ALB:
select elb_status_code,target_status_code,count(*) as count
from lux_alb_webcache_blue
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z'
group by elb_status_code,target_status_code
order by elb_status_code,target_status_code;
Middle tier ALB:
select elb_status_code,target_status_code,count(*) as count
from lux_alb_middle_blue
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z'
group by elb_status_code,target_status_code
order by elb_status_code,target_status_code;
MarkLogic ALB:
select elb_status_code,target_status_code,count(*) as count
from lux_alb_marklogic_blue
where time between '2024-07-17T19:50:00.000000Z' and '2024-07-17T20:20:00.000000Z'
group by elb_status_code,target_status_code
order by elb_status_code,target_status_code;
Thanks for the request counts and queries, @xinjianguo!
Based on the following chart from QA's report, the ramp up schedule was five VUs every minute for 24 minutes until reaching a peak of 120 for just over a minute before a steep drop off (presumed NeoLoad crash). The five VUs per minute were comprised of [TODO: breakdown by flow / transaction].
We switched to a more aggressive ramp up schedule as NeoLoad is having trouble dealing with LUX errors that we believe LoadRunner handled better; QA is in contact with the vendor on the matter.
Trimmed backend logs: 20240717-blue-as-test-backend-logs-trimmed.zip
Backend log mining output: 20240717-1950-2020-mined-log-output.zip
During #181's performance test, node 217 had a higher CPU utilization than the other two nodes (#181's CPU utilization comment). This time, that was the case for node 22: node 22 had 17 of the 18 data points when utilization was over 95%. As part of the upgrade, all of Blue's nodes received new ec2 instances.
The ratio of requests by type appears to have changed between LoadRunner and NeoLoad. The team concluded the NeoLoad version is still in flux and thus any comparisons could be questionable. Nonetheless, capturing here as the work was done.
Tables supporting the above table:
Caveated, we experienced a decrease in 504s received by the MarkLogic load balancer.
During 18 Jun's performance test, there were 3x as many 504s at the ML load balancer than ML processed. During this test, it was only 0.22x. This reduced pressure on the data service retry mechanism. Rerouting advancedSearchConfig and dataConstants requests may have helped somewhat; however, due to differences between the LoadRunner and NeoLoad implementations of the test, the ratio of requests by type was significantly different including thousands fewer advancedSearchConfig and dataConstants requests than anticipated. For more on the request type composite change, see this comment.
Other observations copied from Teams...
Per CPU utilization, we know the NeoLoad test pushed MarkLogic hard and that neither the v8 engine crashed nor MarkLogic process restarted --all good.
But I otherwise find it hard to compare yesterday's test to the previous tests.
A metric I like checking is the number of facet requests per search request. During the 18 Jun test, there were 11.86 facet requests per search request. During yesterday's test, there were only 6.10 facet requests per search request. That's someone concerning given all/most search results tabs have more than six facets.
I find it odd that there were over 18K fewer than expected advanced search configuration and data constant requests (combined) but doubt that had a material effect given they are the equivalent of document requests, as they are all very lightweight requests.
I am surprised that there were any failed advanced search configuration or data constants requests given the successful requests were served up by the second app server and that app server never registered any queued requests.
This was the first performance test executed by NeoLoad.
Primary Objective
We will be following scenario AM, which is similar to scenario J but with Advanced Search Configuration and Data Contants being moved from Group-1 to Group-2. This was done to minimize errors during the performance test. Scenario AM and state the primary objective.
The purpose of this test is to validate that ML 11.3 performs sufficiently to be moved to production with the 7/29 Blue/Green switch.
Changes Being Tested
We are primarily testing an upgrade to ML 11.3.0 GA, with the 2024-05-29 dataset.
Other changes compared to both previous perform tests:
Context
Environment and Versions
_getSearchTermConfig
.Backend Application Server Configuration
Tasks
For more information please see the documentation: LUX Performance Testing Procedure
Prep, Start, and Preliminary Checks
v8 delay timeout
.Collect data
Restore and Verify Environment
Analyze