Performance test: include alternative names when resolving name search criteria

jffcamp commented 1 month ago

Primary objective: Determine if switching from the smaller primary name fields to the larger fields containing both primary and alternative names still yields acceptable performance. See below for specific changes being tested. The associated development ticket is #100.

Code and Configuration Changes:

The related documents portion of the keyword search pattern switched from the referencePrimaryName field to the referenceName field. By search scope, here are the record types that included or excluded from reference fields:
- Included:
  - Agent: Group, Person
  - Concept: Language, MeasurementUnit, Type
  - Event: Activity, Period
  - Place: Place
- Excluded:
  - Concept: Currency, Material
  - Item: DigitalObject, HumanMadeObject
  - Work: LinguisticObject, Set, VisualItem
All name search terms less the set search scope switched from their search scope-specific primary name field to their broader name field; for instance, the name search term in the agent scope was changed from agentPrimaryName to agentName.

Environment and versions: Green (as TST) comprised of MarkLogic 11.0.3, Backend v1.15.0, Middle Tier v1.1.18, Frontend v1.25.2, and Dataset produced on 2024-04-18.

Scenario AH of the Perf Test Line Up: our existing dual app server configuration (Scenario J) but with the above-discussed field difference. The last time Scenario J was tested is documented within https://git.yale.edu/lux-its/marklogic/issues/1033 (internal link).

Key metrics we're targeting (column E / scenario J):

Number of application servers: 2 per node. Maximum number of concurrent application server threads:

lux-2: 12 per node for search and related list requests
lux: 6 per node for all other request types
total: 18 per node

For more information please see the documentation: LUX Performance Testing Procedure

Tasks to complete:

[x] Deploy Backend v1.15.0 with the fullTextSearchRelatedFieldName build property set to referenceName.
[x] In QC, verify /lib/appConstants.mjs includes const FULL_TEXT_SEARCH_RELATED_FIELD_NAME = 'referenceName'.trim();
[x] Disable Green's middle-tier caching. ~- [x] Verify middle tier is configured to use both app servers.~ ~- [x] Verify lux-request-group-1 is configured to 6 threads and lux-request-group-2 is configured to 12 threads.~
[x] Verify LUX trace events are enabled plus v8 delay timeout.
[x] Verify no other v8-related trace events is enabled.
[x] Smoke test the front end.
[x] Xinjian: Start collecting OS-level metrics.
[x] Peter: Start collecting middle tier metrics (getMiddleTierStats.sh)
[x] QA: Verify/set ramp-up schedule to 2 simple search VUs, 1 filtered VU, and 1 entity page VU every three minutes until there are 148 users then hold for 15 minutes.
[x] QA: Verify scripts point to GREEN/TST, https://lux-front-tst.collections.yale.edu/
[x] Team: Sign off on the above before proceeding.
[x] QA: Start performance test
[x] QA: Finish performance test

Data collection (Details from procedure):

[x] Xinjian: Stop collecting OS-level metrics and attach to the ticket
[x] Peter: Stop collecting middle-tier metrics and attach to the ticket
[x] Xinjian: Collect data from AWS and attach to ticket. ~- [ ] TBD: Download the monitoring history (level=raw) and attach to the ticket.~ ~- [ ] TBD: Take screenshots of select monitoring history graphs.~ ~- [ ] TBD: Collect, trim, and attach backend logs to the ticket.~ ~- [ ] TBD: Pull app server queue metrics, attach to the ticket, and record in Perf: Key Metrics.~ ~- [ ] TBD: Update online spreadsheet tabs with what is known at this point.~

Revert all configuration changes:

[x] Deploy Backend v1.15.0 with the fullTextSearchRelatedFieldName build property set to referencePrimaryName.
[x] Enable middle-tier caching.
[ ] ~Remove the v8 trace event.~ BH: I think we should leave this enabled, and enable in Blue.

Verify:

[ ] Smoke test the front end.

Analysis:

[ ] TBD: Upon receipt, review report from QA and update related portions of the online spreadsheet tabs.
[ ] TBD: Mine the backend logs?
[x] TBD: Determine if the test is valid.
[ ] TBD: Determine if the performance is acceptable --> #100.

brent-hartwig commented 1 month ago

@jffcamp, https://github.com/project-lux/lux-marklogic/issues/34 includes requests from Engineering of a future performance test. I am interested in testing without the custom error handler but there could be middle tier implications that we should first discuss with @gigamorph (less 408s and more 500s for timed out requests). Probably need to pass for Thursday's test. This ticket's directions does have us enable the requested v8 trace event --which I think we can leave permanently enabled in Blue and Green.

brent-hartwig commented 1 month ago

@jffcamp, a reminder from our last performance test (internal link):

Despite QA revising the LoadRunner scripts to be more inline with Scenario I's June 2023 test and only four additional v8 engine crashes, today's test was comprised of significantly fewer requests. While this does not invalidate this particular test (given we just need to get crash info to MarkLogic Support ticket no. 35746), we may want to be aware of it for future comparisons to older tests.

Note too that there would have been a larger ratio of search estimate requests compared to June 2023. Back in June, the middle tier would request multiple estimates in a single backend request. That since changed to a 1:1 ratio between estimate and backend endpoint requests. The same was not yet true for Search Will Match requests.

xinjianguo commented 1 month ago

OS metrics

cd; cd Apps/LUX/ML $ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.156.104 $ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.157.73 $ ssh -i ch-lux-ssh-prod.pem ec2-user@10.5.254.20

nohup sudo sar -u -r -o /tmp/sar_${HOSTNAME}_$(date +"%Y-%m-%dT%H%M%S").out 10 >/tmp/sar_${HOSTNAME}_$(date +"%Y-%m-%dT%H%M%S")_screen.out 2>&1 &

Screen Shot 2024-05-09 at 11 09 01 AM

received "MarkLogic green Hung or v8 crash" alert at 11:08:49

jffcamp commented 1 month ago

Many crashes. Reverted the change and restarted the test.

jffcamp commented 1 month ago

Test in TST w/change reverted also had errors. Performed performance test in DEV. w/out changes. We got crashes early. Also noticed a crash at around 10:15 AM, before any tests being performed.

jffcamp commented 1 month ago

Xinjian to open a ticket with ML support to investigate DEV crashes.

brent-hartwig commented 1 month ago

Summary of tests performed as part of this ticket, on 9 May 24...

First Test

The scenario we set out to initially test: TST environment with Green backend resolving name search criteria against primary and alternative names.

The first v8 engine crash occurred after five minutes and more ensued. Given the v8 engine did not crash when Scenario J was tested (and retested) in June 2023, we were concerned the addition of alternative names were causing the issue. The test was aborted after 20 to 30 minutes.

Second Test

Elected to switch back to primary names but keep the rest the same: TST environment with Green backend resolving name search criteria against primary names alone.

The first v8 engine crash was also early on. The crashes were slightly less frequent but much more frequent than other scenarios tested. This test was also aborted after 20 to 30 minutes.

We suspected the mixed MarkLogic version environment may be a larger factor than anticipated. Back in Feb 2024, Green and Blue were upgraded to a ML 11.2 nightly build for some testing before bring ML 11.0.3 back. Due to not starting with a new data directory, there were 11.2 remnants, including an upgraded Security database and journal files ML 11.0.3 detected and ignored.

Third Test

Elected to move over to DEV, which is considered to be a clean install of ML 11.2.0 GA. This was this project's first performance test against ML 11.2.0. We deployed vanilla Backend v1.15.0, meaning alternative names were not in play. Basically, test 3 was the same as test 2 but in DEV / ML 11.2.0.

We didn't witness a better outcome. Here too, v8 engine crashes occurred early on and persisted. v8 engine crashes prior to the performance test were also observed, in DEV.

We opened support ticket no. 36846 requesting Support help diagnose the two sets of v8 engine crashes.

project-lux / lux-marklogic