project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Performance Test - scheduled for 2024-06-18 #181

Open brent-hartwig opened 2 weeks ago

brent-hartwig commented 2 weeks ago

Primary Objective

Scenario AK in the Perf Test Line Up: With Scenario AK (#162) as this test's baseline, quantify the performance impact of including alternative names in keyword and name search criteria.

Changes Being Tested

100 on top of paginated facet requests (#160 and #161), which is part of the baseline test (#162). #100's changes:

  1. The related documents portion of the keyword search pattern switched from the referencePrimaryName field to the referenceName field. By search scope, here are the record types that included or excluded from reference fields:
    • Included:
      • Agent: Group, Person
      • Concept: Language, MeasurementUnit, Type
      • Event: Activity, Period
      • Place: Place
    • Excluded:
      • Concept: Currency, Material
      • Item: DigitalObject, HumanMadeObject
      • Work: LinguisticObject, Set, VisualItem
  2. All name search terms less the set search scope switched from their search scope-specific primary name field to their broader name field; for instance, the name search term in the agent scope was changed from agentPrimaryName to agentName.

Context

Environment and Versions

Backend Application Server Configuration

Tasks

For more information please see the documentation: LUX Performance Testing Procedure

Prep, Start, and Preliminary Checks

Collect data

Restore and Verify Environment

Analyze

Notes

brent-hartwig commented 1 week ago

Verified in Blue:

  1. FULL_TEXT_SEARCH_RELATED_FIELD_NAME = 'referenceName'
  2. 7 of 8 name search terms switched to include alternative names.
  3. Pagination support for facets.

cc: @clarkepeterf

clarkepeterf commented 1 week ago

Middle tier stats: 20240618-tst-middle-tier-stats.txt

xinjianguo commented 1 week ago

ML load balancer: perf-test-20240618-ML-ALB

CPU utilization of ML nodes: perf-test-20240618-ML-ec2-CPU

504s from ML load balancer: perf-test-20240618-ML-ALB-504

504s from web cache: perf-test-20240618-webcache-ALB-504

xinjianguo commented 1 week ago

OS-level metrics:

on all 3 ML nodes

cd /tmp
ls -l sar*06-18*
sudo gzip sar*06-18*

on local: cd ~/Apps/LUX/marklogic/scripts/logAnalysis mkdir ~/Apps/LUX/ML/test/20240618 vi collectBackendLogs.sh ./collectBackendLogs.sh

sar_ip-10-5-156-239.its.yale.edu_2024-06-18T165429_screen.out.gz sar_ip-10-5-156-239.its.yale.edu_2024-06-18T165429.out.gz sar_ip-10-5-157-48.its.yale.edu_2024-06-18T165441_screen.out.gz sar_ip-10-5-157-48.its.yale.edu_2024-06-18T165441.out.gz sar_ip-10-5-254-35.its.yale.edu_2024-06-18T165459_screen.out.gz sar_ip-10-5-254-35.its.yale.edu_2024-06-18T165459.out.gz

gigamorph commented 1 week ago

Status code counts

CloudFront (non-frontend routes): perf-cf-api-data

WebCache ALB: perf-webcache-alb

Middle tier ALB: perf-middle-alb

MarkLogic ALB: perf-marklogic-alb

brent-hartwig commented 1 week ago

ML monitoring exports:

memory-detail-20240618-191008.xls network-detail-20240618-191309.xls overview-20240618-190841.xls servers-detail-20240618-191130.xls xdqp-server requests detail-20240618-191041.xls cpu-detail-20240618-190721.xls databases-detail-20240618-201419.xls file-i_o detail-20240618-190924.xls

brent-hartwig commented 1 week ago

ML Monitoring History

Time period: 1649 - 1900 UTC

This other comment speaks to the back-to-back drops in system demand that started about 25 minutes in the test, and reflected in several of the following charts.

CPU:

01-cpu

File IO Detail:

02-file-io

Memory:

03-memory

Intra-cluster activity, 1 of 2:

04-intra-cluster-1-of-2

Intra-cluster activity, 2 of 2:

05-intra-cluster-2-of-2

Both HTTP app servers:

06-http-servers

Just the lux-request-group-1 app server (all requests here except search and relatedList requests, up to 6 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

image

Just the lux-request-group-2 app server (just search and relatedList requests, up to 12 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

image

Data node characteristics for the lux-content database alone:

image

brent-hartwig commented 1 week ago

Trimmed backend logs:

20240618-blue-as-test-node-217-ErrorLog-trimmed.zip 20240618-blue-as-test-node-217-8004-RequestLog-trimmed.zip 20240618-blue-as-test-node-217-8004-ErrorLog-trimmed.zip 20240618-blue-as-test-node-217-8004-AccessLog-trimmed.zip 20240618-blue-as-test-node-217-8003-RequestLog-trimmed.zip 20240618-blue-as-test-node-217-8003-ErrorLog-trimmed.zip 20240618-blue-as-test-node-217-8003-AccessLog-trimmed.zip 20240618-blue-as-test-node-111-ErrorLog-trimmed.zip 20240618-blue-as-test-node-111-8004-RequestLog-trimmed.zip 20240618-blue-as-test-node-111-8004-ErrorLog-trimmed.zip 20240618-blue-as-test-node-111-8004-AccessLog-trimmed.zip 20240618-blue-as-test-node-111-8003-RequestLog-trimmed.zip 20240618-blue-as-test-node-111-8003-ErrorLog-trimmed.zip 20240618-blue-as-test-node-111-8003-AccessLog-trimmed.zip 20240618-blue-as-test-node-22-ErrorLog-trimmed.zip 20240618-blue-as-test-node-22-8004-RequestLog-trimmed.zip 20240618-blue-as-test-node-22-8004-ErrorLog-trimmed.zip 20240618-blue-as-test-node-22-8004-AccessLog-trimmed.zip 20240618-blue-as-test-node-22-8003-RequestLog-trimmed.zip 20240618-blue-as-test-node-22-8003-ErrorLog-trimmed.zip 20240618-blue-as-test-node-22-8003-AccessLog-trimmed.zip

brent-hartwig commented 1 week ago

About 25 minutes into the test, utilization rates dropped in the backend. Up to that point, there was a single JS-FATAL instance, which we first observed in #162, but the database remained available and no ML process restarted. Seong was unable to find evidence of a crash higher in the stack. Yisha looked at the LoadRunner side but also could not come up with an explanation. It was the only such instance (back-to-back instances) during this test.

image

brent-hartwig commented 1 week ago

@prowns, @jffcamp, and @clarkepeterf, below is a comparison between the baseline test (#162) and this one of backend request counts by response duration and request type, represented in percentages. In addition to average VU flow response times and error rates, this may serve as an important means to assess the performance impact of this change.

Only successful requests are represented. Compared to the previous test, there were 8% fewer backend requests. Given the test also ran for 3.7% shorter, I wouldn't think the fewer backend requests represent a material difference in system demand.

The larger the percent in the first three to four progressively longer / overlapping timespans, the better. The larger the percent in the last two timespans, the worse. I haven't looked too closely but am yet to see anything alarming.

(Apologies for the formatting. Markdown does not appear to support merged cells, at least how I'm creating this from Excel.)

Endpoint < 10 millis < 100 millis < 1 second < 3 seconds 3-10 seconds > 10 seconds
07 Jun 18 Jun 07 Jun 18 Jun 07 Jun 18 Jun 07 Jun 18 Jun 07 Jun 18 Jun 07 Jun 18 Jun
Search 2.25% 2.78% 32.81% 30.79% 80.93% 78.97% 100.00% 99.96% 0.00% 0.04% 0.00% 0.00%
Related Lists 0.00% 0.00% 0.00% 0.00% 47.31% 39.98% 92.65% 87.68% 5.49% 10.02% 1.87% 2.29%
Facets 5.56% 4.55% 26.83% 25.61% 88.57% 86.65% 95.54% 94.54% 2.79% 3.18% 1.66% 2.28%
Search Will Match 0.71% 0.64% 15.25% 15.79% 60.79% 60.89% 82.57% 86.66% 2.18% 1.89% 15.25% 11.46%
Search Estimate 5.30% 4.85% 35.37% 31.98% 98.84% 98.62% 100.00% 100.00% 0.00% 0.00% 0.00% 0.00%
Document 97.89% 97.67% 99.98% 99.97% 100.00% 100.00% 100.00% 100.00% 0.00% 0.00% 0.00% 0.00%
brent-hartwig commented 1 week ago

Even distribution among the three ML nodes:

Node Log Count %
217 request 94,562 33.33%
111 request 94,950 33.47%
22 request 94,210 33.21%
Total 283,722 100.00%
217 access 92,726 33.33%
111 access 93,077 33.45%
22 access 92,416 33.22%
Total 278,219 100.00%

Regarding the difference in counts between the request logs and access logs, one theory is retries at the MarkLogic group-level whereby a request could be retried and represented each time in the request log but only once in the access log. One reason for this would be deadlocks; however, grep -i dead *.txt | grep -v "Applied the" | grep -v "Providing search" didn't reveal any. Perhaps instead it has to do with the app server queue filling up on port 8003. Between ports 8003 and 8004, only 8003's request logs have more entries than the access logs. Perhaps we will see these counts get closer after implementing https://github.com/project-lux/lux-middletier/issues/67 and/or #180. Here are the differences between the (8003) request logs and access logs from this test:

Node Diff %
217 1,836 33.36%
111 1,873 34.04%
22 1,794 32.60%
Total 5,503 100.00%
brent-hartwig commented 1 week ago

We may be able to optimize searchWillMatch via changes to relatedListsLib.mjs's PRIORITY_BY_RELATION_KEY. That variable may be used to influence the order a related list's triple searches are executed in when only needing to know if a related list has at least one item. This capability is utilized in the frontend entity page context to determine whether to display a UI widget. Below is a formatted version of this test's atLeastOneRelatedListItem.txt (log mining output). Related lists without items are filtered out. Remaining rows are sorted by duration. While duration for the same request can vary based on concurrent demand, note the number of relations checked, in column B. In this context, the fewer the better.

Ultimately, this is about how fast an agent, concept, or place entity page begins to render for a user. Each associated searchWillMatch request includes criteria for four related lists, asking whether they have at least one item.

image

brent-hartwig commented 1 week ago

Backend log findings, and related / follow-up tickets:

  1. The only entries above info-level in the main error logs contain "Slow send" and "Clearing expired". No concerns.
  2. Unlike the previous perf test which has two XDMP-XDQPINVREQ instances and counterintuitive for this test, no request attempted to transfer too much data between nodes. Related: monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/25.
  3. There were no failed facet requests. Related: monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/21.
  4. No facet requests exceeded the product threshold. Related: #179 and monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/20.
  5. There were no failed related list requests.
  6. There were no failed search requests.
  7. The related list relations threshold was hit multiple times, as it would have been during previous tests as well. Related: #30 and monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/23.
  8. There were 1,133 searchEstimate-related XDMP-JSONDOC errors. Related #183 and monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/24.
  9. There were four instances of Error running JavaScript request: null. Unlike #162's instances, these were all associated to searchWillMatch, which #177 is to help with in the future.
  10. There were six instances of JS-FATAL / JS-BAD pair errors. Same as [#162's instances](), these were all associated to searchWillMatch. Hereto, #177 is to help understand more about these, with in the future.
  11. There were no other noteworthy findings in the application logs.
brent-hartwig commented 1 week ago

Mined output from backend logs:

2024-06-18-perf-test-request-metrics.xlsx 2024-06-18-perf-test-all-requests-with-durations.xlsx 2024-06-18-perf-test-at-least-one-related-list-item.xlsx 2024-06-18-perf-test-search-params-and-durations.xlsx allFacetRequests.txt allFailedFacetRequests.txt errorLogEntriesAboveInfo.txt exceededFacetProductThreshold.txt failedRelatedListRequests.txt failedSearchOnlyRequests.txt hitMaxNumberOfRelations.txt oddRelatedListErrors.txt searchLongRunning.txt searchParamsAndDurations.json

brent-hartwig commented 1 week ago

As with #162 (in this comment), AWS reported the ML LB returned 1,000s of 504s per minute for the duration of the test while the webcache LB peaked at 265 per minute, excluding a peaks associated with an unexplained dropped in CPU utilization within the ML cluster.

Also same as #162, 8003's request queue was full for most of the test.

:small_blue_diamond: New! When comparing how many requests the ML LB received per minute to the average number of requests the backend processed per minute, we see there's about a 3:1 ratio. The backend processed a total of 278,219 requests over 130 minutes, yielding an average of 2,140 requests per minute. This chart shows the ML LB processing about 6K per minute:

image

Ideas we're tracking to counter this:

brent-hartwig commented 1 week ago

Since #162, additional logging was enabled in an attempt to better understand the fewer than expected backend requests. The additional logging gives us response counts by status code throughout the stack. Below is a table of this information, along with some questions.

Credit to Seong for co-investigating, specifically digging deeper into CloudFront and the web cache. His findings help to better explain the counts.

image


Q1. Why does CloudFront have 21K fewer requests than the Web Cache LB?

The Web Cache LB reported 16K more 5XXs than CloudFront and CloudFront has the ability to retry failed GET requests up to three times.

Other 5K unknown.

Comparing at the status code level gets more confusing; e.g., why does CloudFront return fewer 404s?


Q2. Why does the Web Cache LB have more responses than the middle tier LB despite the cache being disabled?

Perhaps the 'disabled' cache is still caching and/or consolidating some requests. Interested in completely bypassing the cache in a future test.


Q3. Why does the Web Cache LB have fewer 200s than the middle tier LB?

16K of the 27K may be explained by the web cache converting a 200 to a 304 due to etag comparisons (The middle tier could be modified to return 304s, when appropriate.).


Q4. Why is there only a 1:1.09 ratio between the (254,793) middle tier and (278,339) backend requests?

With one exception, middle tier requests that make backend requests only make one backend request. The exception is when a document is requested but a profile is not specified; in that case, there are two backend requests: the document and a searchWillMatch request.

Note too that there are middle tire requests that go no further, including 3,150 /api/_info requests, 909 health checks, and ~100 cyber attack attempts.


Q5. What happens when searchWillMatch fails in the context of /data/:type/:uuid?

TBD


Q6. Did the Data Service proxies retry 494K requests?

We have no other explanation for the difference in 255K middle tier LB target responses (Node.js) and the 750K ML LB requests.


Q7. How many requests were the Data Service proxies unable to successfully retry?

TBD


162 comments on the matter: first and second


Seong's take on status code 0 and status code 460, which I agree with: "It seems, status 0 from CloudFront and 460 from web cache and middle tier, they mean the same thing: client closed connection before server could respond. They are not standard HTTP status codes -- they cannot be returned to the client after all."

brent-hartwig commented 1 week ago

CPU Utilization Node Variance

CPU utilization was above 90% 27 of 399 data points. While that averages to 9 of 133 data points per node, it turns out:

As detailed in this comment, node 217 processed 33.33% of the requests.

Data informing the above: cpu-detail-20240618-190721.xlsx, which is a formatted version of the previously shared version.

Takeaway: node 217 is taxed more than the other two nodes --possibly enough to investigate and/or monitor. A possible explanation is a sub-par ec2 instance (or dependency thereof).

brent-hartwig commented 1 week ago

After adding data from QA's report, we were able to complete this test's column in the key metrics spreadsheet.

Here is a comparison of this test to its baseline:

image

Highlights of the comparison:

  1. Both tests had an overall error rate of 4.41% --a positive indicator.
  2. The eight VU flows (a.k.a., transactions) with individual error rates of 10% or greater were the same in both tests.
    • Four had greater error rates in the latest test, ranging from 0.59% to 9.47%. The flow with the largest increase in was "Lux_Events_tab".
    • The other four had smaller error rates in the latest test, ranging from 1.73% to 17.87%. The flow with the largest decrease was "Lux_Objects_Faucets".
    • Unknown to what degree that these flow-level error rate changes are due to the change being tested.
  3. In three of the four individual VU flows we track, this test was faster with three of them, including simple search --opposite of what one may have expected.
  4. This test had a faster average, unweighted VU response time --another surprise.
  5. This test took CPU utilization from 90.32% to 94.47%. Note that one node was taxed more than the other two, which is detailed in the CPU Utilization Node Variance comment.
  6. The backend request count was inline with the baseline test. Separately, we are still looking into the lower than expected backend request counts; see the stack response count analysis comment for an analysis of the load balancers.
  7. In both tests, 8003's request queue reported full for most of the test and the ML load balancer returned MANY 504s. Here too, we continue to look into it; for more, see the 504s comment.