brent-hartwig commented 2 weeks ago

Primary Objective

Scenario AK in the Perf Test Line Up: With Scenario AK (#162) as this test's baseline, quantify the performance impact of including alternative names in keyword and name search criteria.

Changes Being Tested

100 on top of paginated facet requests (#160 and #161), which is part of the baseline test (#162). #100's changes:

The related documents portion of the keyword search pattern switched from the referencePrimaryName field to the referenceName field. By search scope, here are the record types that included or excluded from reference fields:
- Included:
  - Agent: Group, Person
  - Concept: Language, MeasurementUnit, Type
  - Event: Activity, Period
  - Place: Place
- Excluded:
  - Concept: Currency, Material
  - Item: DigitalObject, HumanMadeObject
  - Work: LinguisticObject, Set, VisualItem
All name search terms less the set search scope switched from their search scope-specific primary name field to their broader name field; for instance, the name search term in the agent scope was changed from agentPrimaryName to agentName.

Context

Environment and Versions

Environment: TST which is configured to the Blue backend.
MarkLogic 11.2.0 GA
Backend branch TODO, which is comprised of:
- v1.17.0
- Paginated facets (#160 + #161)
- Alternative names in name search (#100)
Middle Tier v1.1.9 + facet pagination
Frontend v1.27.0 + facet pagination
Dataset produced on 2024-05-14 <-- #162 was against 2024-05-29.

Backend Application Server Configuration

lux-request-group-1 on port 8003: The middle tier is expected to send all requests here except search and relatedList requests. Maximum of 6 concurrent requests.
lux-request-group-2 on port 8004: The middle tier is expected to send all search and relatedList requests to this application server. Maximum of 12 concurrent requests.
Maximum of 18 concurrent requests per node.

Tasks

For more information please see the documentation: LUX Performance Testing Procedure

Prep, Start, and Preliminary Checks

[x] Confirm the most recent blue-green switch is 100% complete (i.e., no part of TST is using PROD).
[x] Deploy code and/or configuration changes that are being tested.
[x] Disable middle-tier caching in TST (instructions).
[x] Verify LUX trace events are enabled plus v8 delay timeout.
[x] Smoke test the front end.
[x] Start collecting OS-level metrics (instructions).
[x] Start collecting middle tier metrics (script)
[x] QA: Verify/set ramp-up schedule to 2 simple search VUs, 1 filtered VU, and 1 entity page VU every three minutes until there are 148 users then hold for 15 minutes.
[x] QA: Verify test is configured with three second wait times.
[x] QA: Verify scripts point to TST, https://lux-front-tst.collections.yale.edu/
[x] Team: Sign off on the above before proceeding.
[x] QA: Start performance test. Start Time -- 12:48:51 PM
[x] Team: begin monitoring for v8 engine crashes.
[x] Team: check total request count at 10 minutes.
[x] Team: check total request count at 15 minutes.
[x] Team: check total request count at 20 minutes.
[x] QA: Finish performance test. End Time -- 2:58:32 PM

Collect data

[x] Stop collecting OS-level metrics and attach to the ticket.
[x] Stop collecting middle-tier metrics and attach to the ticket.
[x] Collect data from AWS and attach to ticket.
[x] Collect ML monitoring history (instructions).
[x] Collect (script), trim (script), and attach backend logs to the ticket.
[ ] Pull app server queue metrics (script), attach to the ticket, and record in Perf: Key Metrics. Starting in ML 11.2.0, this information may be included in the ML monitoring history screenshots and exports.
[ ] Update online spreadsheet tabs with what is known at this point.

Restore and Verify Environment

[x] Revert this test's code and configuration changes
[x] Enable middle-tier caching (instructions).
[x] Smoke test the front end.

Analyze

[ ] Upon receipt, review report from QA and update related portions of the online spreadsheet tabs.
[x] Mine the backend logs? Yes. Results are included in multiple comments.
[ ] Determine if the test is valid.
[ ] Determine if the performance is acceptable.

Notes

Test duration: 12:48:51 - 2:58:32 p.m.

brent-hartwig commented 1 week ago

Verified in Blue:

FULL_TEXT_SEARCH_RELATED_FIELD_NAME = 'referenceName'
7 of 8 name search terms switched to include alternative names.
Pagination support for facets.

cc: @clarkepeterf

clarkepeterf commented 1 week ago

Middle tier stats: 20240618-tst-middle-tier-stats.txt

xinjianguo commented 1 week ago

ML load balancer: perf-test-20240618-ML-ALB

CPU utilization of ML nodes: perf-test-20240618-ML-ec2-CPU

504s from ML load balancer: perf-test-20240618-ML-ALB-504

504s from web cache: perf-test-20240618-webcache-ALB-504

xinjianguo commented 1 week ago

OS-level metrics:

on all 3 ML nodes

cd /tmp
ls -l sar*06-18*
sudo gzip sar*06-18*

on local: cd ~/Apps/LUX/marklogic/scripts/logAnalysis mkdir ~/Apps/LUX/ML/test/20240618 vi collectBackendLogs.sh ./collectBackendLogs.sh

sar_ip-10-5-156-239.its.yale.edu_2024-06-18T165429_screen.out.gz sar_ip-10-5-156-239.its.yale.edu_2024-06-18T165429.out.gz sar_ip-10-5-157-48.its.yale.edu_2024-06-18T165441_screen.out.gz sar_ip-10-5-157-48.its.yale.edu_2024-06-18T165441.out.gz sar_ip-10-5-254-35.its.yale.edu_2024-06-18T165459_screen.out.gz sar_ip-10-5-254-35.its.yale.edu_2024-06-18T165459.out.gz

gigamorph commented 1 week ago

Status code counts

CloudFront (non-frontend routes): perf-cf-api-data

WebCache ALB: perf-webcache-alb

Middle tier ALB: perf-middle-alb

MarkLogic ALB: perf-marklogic-alb

brent-hartwig commented 1 week ago

ML monitoring exports:

memory-detail-20240618-191008.xls network-detail-20240618-191309.xls overview-20240618-190841.xls servers-detail-20240618-191130.xls xdqp-server requests detail-20240618-191041.xls cpu-detail-20240618-190721.xls databases-detail-20240618-201419.xls file-i_o detail-20240618-190924.xls

brent-hartwig commented 1 week ago

ML Monitoring History

Time period: 1649 - 1900 UTC

This other comment speaks to the back-to-back drops in system demand that started about 25 minutes in the test, and reflected in several of the following charts.

CPU:

01-cpu

File IO Detail:

02-file-io

Memory:

03-memory

Intra-cluster activity, 1 of 2:

04-intra-cluster-1-of-2

Intra-cluster activity, 2 of 2:

05-intra-cluster-2-of-2

Both HTTP app servers:

06-http-servers

Just the lux-request-group-1 app server (all requests here except search and relatedList requests, up to 6 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

Just the lux-request-group-2 app server (just search and relatedList requests, up to 12 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

Data node characteristics for the lux-content database alone:

brent-hartwig commented 1 week ago

Trimmed backend logs:

brent-hartwig commented 1 week ago

[x] Acknowledged, understood, issue submitted, or resolved

About 25 minutes into the test, utilization rates dropped in the backend. Up to that point, there was a single JS-FATAL instance, which we first observed in #162, but the database remained available and no ML process restarted. Seong was unable to find evidence of a crash higher in the stack. Yisha looked at the LoadRunner side but also could not come up with an explanation. It was the only such instance (back-to-back instances) during this test.

brent-hartwig commented 1 week ago

[x] Acknowledged, understood, issue submitted, or resolved

@prowns, @jffcamp, and @clarkepeterf, below is a comparison between the baseline test (#162) and this one of backend request counts by response duration and request type, represented in percentages. In addition to average VU flow response times and error rates, this may serve as an important means to assess the performance impact of this change.

Only successful requests are represented. Compared to the previous test, there were 8% fewer backend requests. Given the test also ran for 3.7% shorter, I wouldn't think the fewer backend requests represent a material difference in system demand.

The larger the percent in the first three to four progressively longer / overlapping timespans, the better. The larger the percent in the last two timespans, the worse. I haven't looked too closely but am yet to see anything alarming.

(Apologies for the formatting. Markdown does not appear to support merged cells, at least how I'm creating this from Excel.)

Endpoint	< 10 millis		< 100 millis		< 1 second		< 3 seconds		3-10 seconds		> 10 seconds
	07 Jun	18 Jun	07 Jun	18 Jun	07 Jun	18 Jun	07 Jun	18 Jun	07 Jun	18 Jun	07 Jun	18 Jun
Search	2.25%	2.78%	32.81%	30.79%	80.93%	78.97%	100.00%	99.96%	0.00%	0.04%	0.00%	0.00%
Related Lists	0.00%	0.00%	0.00%	0.00%	47.31%	39.98%	92.65%	87.68%	5.49%	10.02%	1.87%	2.29%
Facets	5.56%	4.55%	26.83%	25.61%	88.57%	86.65%	95.54%	94.54%	2.79%	3.18%	1.66%	2.28%
Search Will Match	0.71%	0.64%	15.25%	15.79%	60.79%	60.89%	82.57%	86.66%	2.18%	1.89%	15.25%	11.46%
Search Estimate	5.30%	4.85%	35.37%	31.98%	98.84%	98.62%	100.00%	100.00%	0.00%	0.00%	0.00%	0.00%
Document	97.89%	97.67%	99.98%	99.97%	100.00%	100.00%	100.00%	100.00%	0.00%	0.00%	0.00%	0.00%

brent-hartwig commented 1 week ago

[x] Acknowledged, understood, issue submitted, or resolved

Even distribution among the three ML nodes:

Node	Log	Count	%
217	request	94,562	33.33%
111	request	94,950	33.47%
22	request	94,210	33.21%
Total		283,722	100.00%

217	access	92,726	33.33%
111	access	93,077	33.45%
22	access	92,416	33.22%
Total		278,219	100.00%

Regarding the difference in counts between the request logs and access logs, one theory is retries at the MarkLogic group-level whereby a request could be retried and represented each time in the request log but only once in the access log. One reason for this would be deadlocks; however, grep -i dead *.txt | grep -v "Applied the" | grep -v "Providing search" didn't reveal any. Perhaps instead it has to do with the app server queue filling up on port 8003. Between ports 8003 and 8004, only 8003's request logs have more entries than the access logs. Perhaps we will see these counts get closer after implementing https://github.com/project-lux/lux-middletier/issues/67 and/or #180. Here are the differences between the (8003) request logs and access logs from this test:

Node	Diff	%
217	1,836	33.36%
111	1,873	34.04%
22	1,794	32.60%
Total	5,503	100.00%

brent-hartwig commented 1 week ago

[x] Acknowledged, understood, issue submitted, or resolved: submitted #202

We may be able to optimize searchWillMatch via changes to relatedListsLib.mjs's PRIORITY_BY_RELATION_KEY. That variable may be used to influence the order a related list's triple searches are executed in when only needing to know if a related list has at least one item. This capability is utilized in the frontend entity page context to determine whether to display a UI widget. Below is a formatted version of this test's atLeastOneRelatedListItem.txt (log mining output). Related lists without items are filtered out. Remaining rows are sorted by duration. While duration for the same request can vary based on concurrent demand, note the number of relations checked, in column B. In this context, the fewer the better.

Ultimately, this is about how fast an agent, concept, or place entity page begins to render for a user. Each associated searchWillMatch request includes criteria for four related lists, asking whether they have at least one item.

brent-hartwig commented 1 week ago

[x] Acknowledged, understood, issue submitted, or resolved

Backend log findings, and related / follow-up tickets:

The only entries above info-level in the main error logs contain "Slow send" and "Clearing expired". No concerns.
Unlike the previous perf test which has two XDMP-XDQPINVREQ instances and counterintuitive for this test, no request attempted to transfer too much data between nodes. Related: monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/25.
There were no failed facet requests. Related: monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/21.
No facet requests exceeded the product threshold. Related: #179 and monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/20.
There were no failed related list requests.
There were no failed search requests.
The related list relations threshold was hit multiple times, as it would have been during previous tests as well. Related: #30 and monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/23.
There were 1,133 searchEstimate-related XDMP-JSONDOC errors. Related #183 and monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/24.
There were four instances of Error running JavaScript request: null. Unlike #162's instances, these were all associated to searchWillMatch, which #177 is to help with in the future.
There were six instances of JS-FATAL / JS-BAD pair errors. Same as [#162's instances](), these were all associated to searchWillMatch. Hereto, #177 is to help understand more about these, with in the future.
There were no other noteworthy findings in the application logs.

brent-hartwig commented 1 week ago

Mined output from backend logs:

2024-06-18-perf-test-request-metrics.xlsx 2024-06-18-perf-test-all-requests-with-durations.xlsx 2024-06-18-perf-test-at-least-one-related-list-item.xlsx 2024-06-18-perf-test-search-params-and-durations.xlsx allFacetRequests.txt allFailedFacetRequests.txt errorLogEntriesAboveInfo.txt exceededFacetProductThreshold.txt failedRelatedListRequests.txt failedSearchOnlyRequests.txt hitMaxNumberOfRelations.txt oddRelatedListErrors.txt searchLongRunning.txt searchParamsAndDurations.json

brent-hartwig commented 1 week ago

[ ] Acknowledged, understood, issue submitted, or resolved

As with #162 (in this comment), AWS reported the ML LB returned 1,000s of 504s per minute for the duration of the test while the webcache LB peaked at 265 per minute, excluding a peaks associated with an unexplained dropped in CPU utilization within the ML cluster.

Also same as #162, 8003's request queue was full for most of the test.

:small_blue_diamond: New! When comparing how many requests the ML LB received per minute to the average number of requests the backend processed per minute, we see there's about a 3:1 ratio. The backend processed a total of 278,219 requests over 130 minutes, yielding an average of 2,140 requests per minute. This chart shows the ML LB processing about 6K per minute:

Ideas we're tracking to counter this:

https://github.com/project-lux/lux-middletier/issues/67
https://github.com/project-lux/lux-marklogic/issues/180
Increase the ML LB timeout.

brent-hartwig commented 1 week ago

[ ] Acknowledged, understood, issue submitted, or resolved

Since #162, additional logging was enabled in an attempt to better understand the fewer than expected backend requests. The additional logging gives us response counts by status code throughout the stack. Below is a table of this information, along with some questions.

Credit to Seong for co-investigating, specifically digging deeper into CloudFront and the web cache. His findings help to better explain the counts.

Q1. Why does CloudFront have 21K fewer requests than the Web Cache LB?

The Web Cache LB reported 16K more 5XXs than CloudFront and CloudFront has the ability to retry failed GET requests up to three times.

Other 5K unknown.

Comparing at the status code level gets more confusing; e.g., why does CloudFront return fewer 404s?

Q2. Why does the Web Cache LB have more responses than the middle tier LB despite the cache being disabled?

Perhaps the 'disabled' cache is still caching and/or consolidating some requests. Interested in completely bypassing the cache in a future test.

Q3. Why does the Web Cache LB have fewer 200s than the middle tier LB?

16K of the 27K may be explained by the web cache converting a 200 to a 304 due to etag comparisons (The middle tier could be modified to return 304s, when appropriate.).

Q4. Why is there only a 1:1.09 ratio between the (254,793) middle tier and (278,339) backend requests?

With one exception, middle tier requests that make backend requests only make one backend request. The exception is when a document is requested but a profile is not specified; in that case, there are two backend requests: the document and a searchWillMatch request.

Note too that there are middle tire requests that go no further, including 3,150 /api/_info requests, 909 health checks, and ~100 cyber attack attempts.

Q5. What happens when searchWillMatch fails in the context of /data/:type/:uuid?

TBD

Q6. Did the Data Service proxies retry 494K requests?

We have no other explanation for the difference in 255K middle tier LB target responses (Node.js) and the 750K ML LB requests.

Q7. How many requests were the Data Service proxies unable to successfully retry?

TBD

162 comments on the matter: first and second

Seong's take on status code 0 and status code 460, which I agree with: "It seems, status 0 from CloudFront and 460 from web cache and middle tier, they mean the same thing: client closed connection before server could respond. They are not standard HTTP status codes -- they cannot be returned to the client after all."

brent-hartwig commented 1 week ago

[ ] Acknowledged, understood, issue submitted, or resolved

CPU Utilization Node Variance

CPU utilization was above 90% 27 of 399 data points. While that averages to 9 of 133 data points per node, it turns out:

Node 22: Never exceeded 89.16%. 22 data points between 85% and 90%.
Node 111: Exceeded 90% three times with a max of 91.41%. 30 data points between 85% and 90%, or 33 of 133 data points at 85% or more.
Node 217: Exceeded 90% 24 times with a max of 94.47%. 61 data points between 85% and 90%, or 85 of 133 data points at 85% or more.

As detailed in this comment, node 217 processed 33.33% of the requests.

Data informing the above: cpu-detail-20240618-190721.xlsx, which is a formatted version of the previously shared version.

Takeaway: node 217 is taxed more than the other two nodes --possibly enough to investigate and/or monitor. A possible explanation is a sub-par ec2 instance (or dependency thereof).

brent-hartwig commented 1 week ago

[ ] Acknowledged, understood, issue submitted, or resolved

After adding data from QA's report, we were able to complete this test's column in the key metrics spreadsheet.

Here is a comparison of this test to its baseline:

Highlights of the comparison:

Both tests had an overall error rate of 4.41% --a positive indicator.
The eight VU flows (a.k.a., transactions) with individual error rates of 10% or greater were the same in both tests.
- Four had greater error rates in the latest test, ranging from 0.59% to 9.47%. The flow with the largest increase in was "Lux_Events_tab".
- The other four had smaller error rates in the latest test, ranging from 1.73% to 17.87%. The flow with the largest decrease was "Lux_Objects_Faucets".
- Unknown to what degree that these flow-level error rate changes are due to the change being tested.
In three of the four individual VU flows we track, this test was faster with three of them, including simple search --opposite of what one may have expected.
This test had a faster average, unweighted VU response time --another surprise.
This test took CPU utilization from 90.32% to 94.47%. Note that one node was taxed more than the other two, which is detailed in the CPU Utilization Node Variance comment.
The backend request count was inline with the baseline test. Separately, we are still looking into the lower than expected backend request counts; see the stack response count analysis comment for an analysis of the load balancers.
In both tests, 8003's request queue reported full for most of the test and the ML load balancer returned MANY 504s. Here too, we continue to look into it; for more, see the 504s comment.

project-lux / lux-marklogic

Performance Test - scheduled for 2024-06-18 #181

Primary Objective

Changes Being Tested

100 on top of paginated facet requests (#160 and #161), which is part of the baseline test (#162). #100's changes:

Context

Environment and Versions

Backend Application Server Configuration

Tasks

Prep, Start, and Preliminary Checks

Collect data

Restore and Verify Environment

Analyze

Notes

Status code counts

ML Monitoring History

162 comments on the matter: first and second

CPU Utilization Node Variance