Performance Test - scheduled for 2024-06-07

brent-hartwig commented 1 month ago

Primary objective: Test the system stability and performance impact of #160. Given previous analysis, we believe #160 will:

Improve system stability when compared with previous 2024 performance tests when instances of the v8 engine crashed early on an frequently. We are looking to get much closer to June 2023's Scenario J performance test when there were zero v8 engine crashes. This was later repeated by Scenario L (another run of Scenario J).
Reduce facet response time, compared to https://git.yale.edu/lux-its/marklogic/issues/1164's original implementation.
Reduce facet response size, compared to https://git.yale.edu/lux-its/marklogic/issues/1164's original implementation.

If available, #161 can be part of this performance test as well. Given the aforementioned analysis was of a search for works, semantic facets were not involved; as such, it is welcome but not required by this test.

Code and Configuration Changes: Required: implementation of #160 and associated frontend and middle tier implementations. Optional: implementation of #161; if included, its associated frontend and middle tier implementations are required.

Environment and versions (update as needed):

Environment: TST which is configured to the GREEN backend.
MarkLogic 11.0.3 (downgraded from 11.2 nightly build)
Backend v1.17.0 + #167 + facet pagination
Middle Tier v1.1.9 + facet pagination
Frontend v1.27.0 + facet pagination
Dataset produced on 2024-05-29

Scenario AK of the Perf Test Line Up: "See if paginating facet requests (#160 and #161) restores system stability by reducing --if not eliminating-- v8 engine crashes."

Backend application server configuration:

lux-request-group-1 on port 8003: The middle tier is expected to send all requests here except search and relatedList requests. Maximum of 6 concurrent requests.
lux-request-group-2 on port 8004: The middle tier is expected to send all search and relatedList requests to this application server. Maximum of 12 concurrent requests.
Maximum of 18 concurrent requests per node.

For more information please see the documentation: LUX Performance Testing Procedure

Tasks:

[x] Confirm the most recent blue-green switch is 100% complete (i.e., no part of TST is using PROD).
[x] #171
[x] Deploy code and/or configuration changes that are being tested.
[x] Disable middle-tier caching in TST
[x] Verify LUX trace events are enabled plus v8 delay timeout.
[x] Smoke test the front end.
[x] Seong: Start collecting OS-level metrics.
[x] Seong: Start collecting middle tier metrics (getMiddleTierStats.sh)
[x] QA: Verify/set ramp-up schedule to 2 simple search VUs, 1 filtered VU, and 1 entity page VU every three minutes until there are 148 users then hold for 15 minutes.
[x] QA: Verify test is configured with three second wait times.
[x] QA: Verify scripts point to TST, https://lux-front-tst.collections.yale.edu/
[x] Team: Sign off on the above before proceeding.
[x] QA: Start performance test
[x] Team: begin monitoring for v8 engine crashes
[x] Team: check total request count at 10 minutes
[x] Team: check total request count at 15 minutes
[x] Team: check total request count at 20 minutes
[x] QA: Finish performance test

Collect Data (Details from procedure):

[x] Seong: Stop collecting OS-level metrics and attach to the ticket
[x] Seong: Stop collecting middle-tier metrics and attach to the ticket
[x] Seong: Collect data from AWS and attach to ticket.
[x] Brent: Download the monitoring history (level=raw) and attach to the ticket.
[x] Brent: Take screenshots of select monitoring history graphs.
[x] Brent: Collect, trim, and attach backend logs to the ticket.
[x] Brent: Pull app server queue metrics, attach to the ticket, and record in Perf: Key Metrics.
[x] Brent: Update online spreadsheet tabs with what is known at this point.

Restore and Verify Environment:

[x] Revert this test's code and configuration changes
[x] Enable middle-tier caching.
[x] Smoke test the front end.

Analyze:

[x] Upon receipt, review report from QA and update related portions of the online spreadsheet tabs.
[x] Mine the backend logs? Yes.
[x] Determine if the test is valid: The test is considered valid, but does come with caveats. See the executive summary.
[x] Determine if the performance is acceptable: again with caveats, performance is considered acceptable. This test was more about determining whether facet request pagination improved system stability, which it did.

Notes:

Test duration: 1:02:08 p.m. - 3:16:30 p.m (Friday, June 7)

brent-hartwig commented 3 weeks ago

The specific backend branch being tested is 161-paginate-via-search-facet-requests. I believe it to be a branch of 160-paginate-non-semantic-facet-requests, which is a branch of release1.18.

In case this test is a raging success and we have time for a second test, we could re-run https://github.com/project-lux/lux-marklogic/issues/132: include alternative names when resolving name search criteria. I created the 161-pagination-PLUS-100-alternative-names branch which is the above plus #100's edits.

cc: @gigamorph, @kamerynB, @prowns, @jffcamp, @clarkepeterf

gigamorph commented 3 weeks ago

Middle tier stats: 20240607-tst-middle-tier-stats.txt

gigamorph commented 3 weeks ago

ML load balancer:

Screenshot 2024-06-07 at 3 46 15 PM

gigamorph commented 3 weeks ago

CPU utilization of ML nodes: Screenshot 2024-06-07 at 4 00 04 PM

504s from ML load balancer: 504-ml

504s from web cache: 504-webcache

brent-hartwig commented 3 weeks ago

Trimmed backend log files:

brent-hartwig commented 3 weeks ago

[x] Acknowledged, understood, issue submitted, or resolved: see follow-up in #181.

There were fewer backend requests than expected --yet there are not significant differences between the total number of executed Virtual User (VU) flows/transactions, overall error rate, and average response times with this test's baseline (22 Jun 23) and a more recent test (14 Feb 24). Test duration doesn't explain it either. Comparison:

	J (1033)	AG (1145)	AK	AK vs. J	AK vs. AG
Date	22 Jun 23	14 Feb 24	7 Jun 24	N/A	N/A
App server thread count(s)	6 & 12	18	6 & 12	N/A	N/A
Backend search request count (200 status code)	27,438	12,214	2,758	-90%	-77%
Backend facet request count (200 status code)	82,687	63,367	32,348	-61%	-49%
Total backend request count (all status codes)	1,016,344	688,599	307,511	-70%	-55%
Frontend response count with status code in the 200s	1,046,953	1,046,953	830,098	-21%	-21%
Total frontend response count (all status codes)	1,173,748	1,173,748	1,073,190	-9%	-9%
Backend requests per frontend response	0.87	0.59	0.29	-67%	-51%
Frontend response per backend request	1.15	1.70	3.49	202%	105%
Total middle tier request count (all status codes)	UNK	UNK	363,788	UNK	UNK
Backend requests per middle tier request	UNK	UNK	0.85	UNK	UNK
Middle tier requests per backend request	UNK	UNK	1.18	UNK	UNK
Total number of executed VU flows	28,573	36,729	32,230	3,657	-4,499
Error rate	2.42%	5.42%	4.41%	1.99%	-1.01%
Average, unweighted VU flow response time (secs)	2.262	2.941	3.297	1.035	0.356
Test duration (HH:MM)	1:55*	2:12	2:15	00:20*	00:03

If the system isn't being pushed as hard as previous tests, the results of this test could be skewed. Regardless, today's test was so much better than those on 9 May 24 (#132) and 22 May 24 (#151) as there were zero v8 engine crashes!

Possible explanations thought of to date (not mutually-exclusive):

Fewer backend requests for select frontend requests. Facts diluting this possibility:
- The search estimate endpoint was changed to only accept the criteria for one search per request.
- There were 9.5K to 24.5K fewer frontend searches during this test, based on which test you compare to.
- A single frontend search for works makes 197 backend requests.
- (9.5K * 197 is over 1.8M, which far exceeds 22 Jun 23's total request count of 1.0M)
Performance test changed. There was a decrease in the number of responses given to the VUs. Unknown if that indicates fewer requests we made due to a change(s) to the performance test, the frontend making fewer requests, or a mix of both. The overall frontend decrease of 9% is smaller than the 55% - 70% backend decrease. The ratio between frontend responses and backend requests reduced significantly but the backend is not fulfilling all frontend requests, including images and analytics.
One or more caches between the browser and MarkLogic's load balancer. This comment digs into the number of frontend and middle tier requests more.
What else?

* Logs were trimmed to 1:55. The QA report states the test ran for 2:15. Found a note that read "Test stopped after full ramp up but did not run for the full 15 minutes afterwards." Regardless, for the purposes of this comparison, 22 Jun 23's test running shorter would skew the results to the benefit of 7 Jun 24's test.

brent-hartwig commented 3 weeks ago

[x] Acknowledged, understood, issue submitted, or resolved: see follow-up in #181.

@gigamorph noticed early on that there were thousands of 504s (gateway timeouts) from the ML load balancer early on. As shown above, that persisted throughout the test but 504s at the web cache later never exceeded 100 per minute. The likely explanation is that the data service proxies automatically retried the 504s. Looking at this and previous tests' app server metrics, a contributing factor could be a full request queue, for the lux-request-group-1 app server. That app server's queue reported being full 82% of the minutely data points, implying it was maxed out for quite a bit of the test. Given the other app server's queue was not used at all, we should consider redirecting one or more request types to it.

brent-hartwig commented 2 weeks ago

ML Monitoring History

Time period: 1700 - 1920 UTC

CPU:

File IO Detail:

Memory:

Intra-cluster activity, 1 of 2:

Intra-cluster activity, 2 of 2:

Both HTTP app servers:

Just the lux-request-group-1 app server (all requests here except search and relatedList requests, up to 6 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

Just the lux-request-group-2 app server (only search and relatedList requests, up to 12 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

Data node characteristics for the lux-content database alone:

brent-hartwig commented 2 weeks ago

ML monitoring history exports:

network-detail-20240610-130259.xls servers-detail-20240610-130334.xls xdqp-server requests detail-20240610-130107.xls cpu-detail-20240610-124745.xls databases-detail-20240610-131620.xls file-i_o detail-20240610-125223.xls memory-detail-20240610-130034.xls

brent-hartwig commented 2 weeks ago

ML App Server Metrics

The following reflects group one's queue was full for the majority of the test. This could explain many of the 504s the MarkLogic load balancer is returning. Please see this above comment for more details and suggested follow-up.

Script: extractAppServerQueueMetrics.js*

Input: 20240607-1700-1920-app-server-stats.json

Output:

[
  {
    "appServer":"lux-request-group-1",
    "largestQueueSize":36,
    "largestQueueSizeCount":115,
    "greaterThanOneCount":125,
    "dataPointCount":141,
    "totalQueuedRequestCount":4327,
    "values":"omitted per script configuration"
  },
  {
    "appServer":"lux-request-group-2",
    "largestQueueSize":0,
    "largestQueueSizeCount":141,
    "greaterThanOneCount":0,
    "dataPointCount":141,
    "totalQueuedRequestCount":0,
    "values":"omitted per script configuration"
  }
]

* The input includes one instance of a queue size of 37, which exceeds the maximum. I changed the script to use 36 for group one, then added one to that total to account for the single report of 37.

brent-hartwig commented 2 weeks ago

OS-level metrics (via sar):

These were downloaded from Green's nodes but filenames do not reflect nodes 104, 73, 20.

sar_ip-10-5-156-171.its.yale.edu_2024-06-07T165241.out.gz sar_ip-10-5-156-171.its.yale.edu_2024-06-07T165241_screen.out.gz sar_ip-10-5-157-195.its.yale.edu_2024-06-07T165318.out.gz sar_ip-10-5-157-195.its.yale.edu_2024-06-07T165318_screen.out.gz sar_ip-10-5-254-21.its.yale.edu_2024-06-07T165336.out.gz sar_ip-10-5-254-21.its.yale.edu_2024-06-07T165336_screen.out.gz

brent-hartwig commented 2 weeks ago

[x] Acknowledged, understood, issue submitted, or resolved: see monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/25.

Looks like two requests failed due to an attempt to send too much data between nodes:

20240607-green-as-test-node-104-ErrorLog-trimmed.txt:2024-06-07 17:43:31.069 Warning: XDQPServerConnection::init: 172.18.94.14 XDMP-XDQPINVREQ: Invalid XDQP request, size=5388703008, client=0, conn=10.5.156.104:7999-172.18.94.14:50564 20240607-green-as-test-node-73-ErrorLog-trimmed.txt:2024-06-07 18:29:36.054 Warning: XDQPServerConnection::init: 172.18.94.12 XDMP-XDQPINVREQ: Invalid XDQP request, size=5388703008, client=0, conn=10.5.157.73:7999-172.18.94.12:48512

Which line up with these entries:

20240607-green-as-test-node-104-8004-ErrorLog-trimmed.txt:2024-06-07 17:43:36.153 Notice: SVC-SOCPEER: Socket peer name error: getpeername 10.5.156.104:8004: Transport endpoint is not connected 20240607-green-as-test-node-73-8004-ErrorLog-trimmed.txt:2024-06-07 18:29:45.069 Notice: SVC-SOCPEER: Socket peer name error: getpeername 10.5.157.73:8004: Transport endpoint is not connected

brent-hartwig commented 2 weeks ago

[x] Acknowledged, understood, issue submitted, or resolved: submitted #177

Encountered five JS-FATAL errors. They all match the following pattern, meaning they were all for searchWillMatch requests. These were found in the 8003 application log files, as opposed to the v8 engine crash log entries we're accustomed to finding in the main error logs.

2024-06-07 18:26:59.300 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
2024-06-07 18:26:59.300 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:26:59.300 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:26:59.300 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:26:59.300 Notice: JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07 18:26:59.300 Notice:+in /luxErrorHandler.mjs [javascript]

Nodes and times associated to all instances:

$ grep JS-FATAL 20240607-green-as-test-node-*-800[3-4]-ErrorLog-trimmed.txt
20240607-green-as-test-node-104-8003-ErrorLog-trimmed.txt:2024-06-07 17:30:33.370 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-104-8003-ErrorLog-trimmed.txt:2024-06-07 18:32:31.255 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-104-8003-ErrorLog-trimmed.txt:2024-06-07 18:58:53.361 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-20-8003-ErrorLog-trimmed.txt:2024-06-07 18:52:31.780 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-73-8003-ErrorLog-trimmed.txt:2024-06-07 18:26:59.300 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()

cc: @clarkepeterf

brent-hartwig commented 2 weeks ago

[x] Acknowledged, understood, issue submitted, or resolved: submitted https://github.com/project-lux/lux-marklogic/issues/178 (low priority)

While there is a JS-BAD instance for each above-discussed JS-FATAL instance, this comment speaks to the other JS-BAD instances. This subset is related to semantic facet requests. Each instance is provided below. Because they all fail after one minute and the request and app server timeouts are both 59 seconds, I believe these are simply timed-out requests. I also checked 22 Jun 23's logs and found four instances; just the "Failed to calculate" log entries are included below, at the end.

The response status code for all was 500, meaning the data service proxies would not have retried them.

2024-06-07 18:27:16.422 Info: [Event:id=LuxFacets] Searching for the 'responsibleUnits' facet values.
2024-06-07 18:27:16.423 Info: [Event:id=LuxSearch] Search 4295481604482034528 parameters: {"searchCriteria":{"_scope":"agent","OR":[{"memberOfInverse":{"AND":[{"curated":{"containingItem":{"text":"booked","_lang":"en"}}},{"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}]}},{"AND":[{"curated":{"containingItem":{"text":"booked","_lang":"en"}}},{"NOT":{"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}}]}]},"searchScope":null,"mayChangeScope":false,"page":1,"pageLength":101,"requestContext":"viaSearchFacet","mayExceedMaximumPageLength":true,"mayEstimate":true,"sortDelimitedStr":"","facetsSoon":false,"synonymsEnabled":false}
...
2024-06-07 18:28:15.099 Info: [Event:id=LuxSearch] Search 4295481604482034528 errored out: {"stack":"JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj\nin [anonymous], at 44:5 [javascript]\nin ./utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]\nin ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]\nin ./searchLib.mjs, at 178:38, in _search() [javascript]\nin ./searchLib.mjs, at 81:9, in search() [javascript]\nin ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]\nin ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]\nin ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]\nin /ds/lux/facets.mjs, at 12:9 [javascript]\nin ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]\nin /ds/lux/facets.mjs, at 11:0 [javascript]\nin xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"}) [javascript]\nin /ds/lux/facets.mjs [javascript]\nin /ds/lux/facets.mjs [javascript]","mlerr":{},"data":["checkMaybeLocal","wrapValue valObj"],"message":"Unexpected failure","name":"JS-BAD","retryable":false,"stackFrames":[{"uri":"[anonymous]","line":44,"column":5,"operation":".toArray()","language":"javascript"},{"uri":"./utils/utils.mjs","line":111,"column":14,"operation":"evalInContentDatabase","language":"javascript","isEval":false},{"uri":"./SearchCriteriaProcessor.mjs","line":188,"column":17,"operation":"getSearchResults","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":178,"column":38,"operation":"_search","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":81,"column":9,"operation":"search","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":313,"column":17,"operation":"_getViaSearchFacetValues","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":269,"column":17,"operation":"_getViaSearchFacets","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":52,"column":37,"operation":"getFacets","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":12,"column":9,"language":"javascript","isEval":false},{"uri":"../../lib/requestHandleLib.mjs","line":10,"column":9,"operation":"handleRequest","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":11,"column":0,"operation":"xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"})","language":"javascript","isEval":false}]}
2024-06-07 18:28:15.099 Info: [Event:id=LuxSearch] {"requestId":"4295481604482034528", "requestCompleted":false, "milliseconds":{"total":58676, "process":1}, "estimate":0, "page":1, "pageLength":101, "requestContext":"viaSearchFacet", "returned":0, "scope":"agent", "changedScope":false, "criteria":{"OR":[{"memberOfInverse":{"AND":[{"curated":{"containingItem":{"text":"booked", "_lang":"en"}}}, {"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}]}}, {"AND":[{"curated":{"containingItem":{"text":"booked", "_lang":"en"}}}, {"NOT":{"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}}]}]}, "ignoredTerms":[], "query":"\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.prefixer(\"http://www.cidoc-crm.org/cidoc-crm/\");\nconst la = op.prefixer(\"https://linked.art/ns/terms/\");\nconst lux = op.prefixer(\"https://lux.collections.yale.edu/ns/\");\nconst skos = op.prefixer(\"http://www.w3.org/2004/02/skos/core#\");\n        const docs = fn.subsequence(\n          cts.search(cts.andQuery([\n          cts.jsonPropertyValueQuery('dataType', [\"Person\", \"Group\"], ['exact']),\n          cts.orQuery([cts.documentQuery(cts\n    .triples(\n      [],\n      [crm(\"P107i_is_current_or_former_member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.andQuery([cts.documentQuery(cts\n    .triples(\n      cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist')),\n      [lux(\"agentOfCuration\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      undefined\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.fieldValueQuery(\n      [\"agentTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3\"],\n      [\"exact\"],\n      1\n    )])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.andQuery([cts.documentQuery(cts\n    .triples(\n      cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist')),\n      [lux(\"agentOfCuration\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      undefined\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.notQuery(cts.fieldValueQuery(\n      [\"agentTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3\"],\n      [\"exact\"],\n      1\n    ))])])\n        ]), [\"unfaceted\"]), \n          1, \n          101\n        ).toArray();\n        const results = [];\n        for (const doc of docs) {\n          results.push({id: doc.baseURI, type: doc.xpath('/json/type')});\n        }\n        results;"}
2024-06-07 18:28:15.099 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58690 milliseconds: responsibleUnits
2024-06-07 18:28:15.100 Notice: JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj
2024-06-07 18:28:15.100 Notice:+in [anonymous], at 44:5 [javascript]
2024-06-07 18:28:15.100 Notice:+in ./utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]
2024-06-07 18:28:15.100 Notice:+in ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]
2024-06-07 18:28:15.100 Notice:+in ./searchLib.mjs, at 178:38, in _search() [javascript]
2024-06-07 18:28:15.100 Notice:+in ./searchLib.mjs, at 81:9, in search() [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs, at 12:9 [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs, at 11:0 [javascript]
2024-06-07 18:28:15.100 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:15.102 Warning: {"errorResponse":{"statusCode":500,"status":"Internal Server Error","messageCode":"JS-BAD","message":"Unexpected failure"}}

2024-06-07 18:26:22.219 Info: [Event:id=LuxFacets] Searching for the 'responsibleCollections' facet values.
2024-06-07 18:26:22.219 Info: [Event:id=LuxSearch] Search 14325137610764299934 parameters: {"searchCriteria":{"_scope":"set","AND":[{"containingItem":{"text":"booked","_lang":"en"}},{"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]},"searchScope":null,"mayChangeScope":false,"page":1,"pageLength":101,"requestContext":"viaSearchFacet","mayExceedMaximumPageLength":true,"mayEstimate":true,"sortDelimitedStr":"","facetsSoon":false,"synonymsEnabled":false}
...
2024-06-07 18:27:21.160 Info: [Event:id=LuxSearch] Search 14325137610764299934 errored out: {"stack":"JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj\nin [anonymous], at 35:5 [javascript]\nin ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]\nin ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]\nin ./searchLib.mjs, at 178:38, in _search() [javascript]\nin ./searchLib.mjs, at 81:9, in search() [javascript]\nin ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]\nin ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]\nin ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]\nin /ds/lux/facets.mjs, at 12:9 [javascript]\nin ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]\nin /ds/lux/facets.mjs, at 11:0 [javascript]\nin xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"}) [javascript]\nin /ds/lux/facets.mjs [javascript]\nin /ds/lux/facets.mjs [javascript]","mlerr":{},"data":["checkMaybeLocal","wrapValue valObj"],"message":"Unexpected failure","name":"JS-BAD","retryable":false,"stackFrames":[{"uri":"[anonymous]","line":35,"column":5,"operation":".toArray()","language":"javascript"},{"uri":"../utils/utils.mjs","line":111,"column":14,"operation":"evalInContentDatabase","language":"javascript","isEval":false},{"uri":"./SearchCriteriaProcessor.mjs","line":188,"column":17,"operation":"getSearchResults","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":178,"column":38,"operation":"_search","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":81,"column":9,"operation":"search","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":313,"column":17,"operation":"_getViaSearchFacetValues","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":269,"column":17,"operation":"_getViaSearchFacets","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":52,"column":37,"operation":"getFacets","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":12,"column":9,"language":"javascript","isEval":false},{"uri":"../../lib/requestHandleLib.mjs","line":10,"column":9,"operation":"handleRequest","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":11,"column":0,"operation":"xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"})","language":"javascript","isEval":false}]}
2024-06-07 18:27:21.160 Info: [Event:id=LuxSearch] {"requestId":"14325137610764299934", "requestCompleted":false, "milliseconds":{"total":58941, "process":1}, "estimate":0, "page":1, "pageLength":101, "requestContext":"viaSearchFacet", "returned":0, "scope":"set", "changedScope":false, "criteria":{"AND":[{"containingItem":{"text":"booked", "_lang":"en"}}, {"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]}, "ignoredTerms":[], "query":"\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.prefixer(\"http://www.cidoc-crm.org/cidoc-crm/\");\nconst la = op.prefixer(\"https://linked.art/ns/terms/\");\nconst lux = op.prefixer(\"https://lux.collections.yale.edu/ns/\");\nconst skos = op.prefixer(\"http://www.w3.org/2004/02/skos/core#\");\n        const docs = fn.subsequence(\n          cts.search(cts.andQuery([\n          cts.jsonPropertyValueQuery('dataType', [\"Set\"], ['exact']),\n          cts.andQuery([cts.documentQuery(cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.fieldValueQuery(\n      [\"setTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d\"],\n      [\"exact\"],\n      1\n    )])\n        ]), [\"unfaceted\"]), \n          1, \n          101\n        ).toArray();\n        const results = [];\n        for (const doc of docs) {\n          results.push({id: doc.baseURI, type: doc.xpath('/json/type')});\n        }\n        results;"}
2024-06-07 18:27:21.161 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58944 milliseconds: responsibleCollections
2024-06-07 18:27:21.161 Notice: JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj
2024-06-07 18:27:21.161 Notice:+in [anonymous], at 35:5 [javascript]
2024-06-07 18:27:21.161 Notice:+in ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]
2024-06-07 18:27:21.161 Notice:+in ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]
2024-06-07 18:27:21.161 Notice:+in ./searchLib.mjs, at 178:38, in _search() [javascript]
2024-06-07 18:27:21.161 Notice:+in ./searchLib.mjs, at 81:9, in search() [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs, at 12:9 [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs, at 11:0 [javascript]
2024-06-07 18:27:21.161 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:27:21.165 Warning: {"errorResponse":{"statusCode":500,"status":"Internal Server Error","messageCode":"JS-BAD","message":"Unexpected failure"}}

2024-06-07 18:27:21.891 Info: [Event:id=LuxFacets] Searching for the 'responsibleCollections' facet values.
2024-06-07 18:27:21.892 Info: [Event:id=LuxSearch] Search 6675450089992221355 parameters: {"searchCriteria":{"_scope":"set","AND":[{"containingItem":{"text":"booked","_lang":"en"}},{"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]},"searchScope":null,"mayChangeScope":false,"page":1,"pageLength":101,"requestContext":"viaSearchFacet","mayExceedMaximumPageLength":true,"mayEstimate":true,"sortDelimitedStr":"","facetsSoon":false,"synonymsEnabled":false}
...
2024-06-07 18:28:21.288 Info: [Event:id=LuxSearch] Search 6675450089992221355 errored out: {"stack":"JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj\nin [anonymous], at 35:5 [javascript]\nin ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]\nin ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]\nin ./searchLib.mjs, at 178:38, in _search() [javascript]\nin ./searchLib.mjs, at 81:9, in search() [javascript]\nin ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]\nin ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]\nin ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]\nin /ds/lux/facets.mjs, at 12:9 [javascript]\nin ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]\nin /ds/lux/facets.mjs, at 11:0 [javascript]\nin xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"}) [javascript]\nin /ds/lux/facets.mjs [javascript]\nin /ds/lux/facets.mjs [javascript]","mlerr":{},"data":["checkMaybeLocal","wrapValue valObj"],"message":"Unexpected failure","name":"JS-BAD","retryable":false,"stackFrames":[{"uri":"[anonymous]","line":35,"column":5,"operation":".toArray()","language":"javascript"},{"uri":"../utils/utils.mjs","line":111,"column":14,"operation":"evalInContentDatabase","language":"javascript","isEval":false},{"uri":"./SearchCriteriaProcessor.mjs","line":188,"column":17,"operation":"getSearchResults","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":178,"column":38,"operation":"_search","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":81,"column":9,"operation":"search","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":313,"column":17,"operation":"_getViaSearchFacetValues","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":269,"column":17,"operation":"_getViaSearchFacets","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":52,"column":37,"operation":"getFacets","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":12,"column":9,"language":"javascript","isEval":false},{"uri":"../../lib/requestHandleLib.mjs","line":10,"column":9,"operation":"handleRequest","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":11,"column":0,"operation":"xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"})","language":"javascript","isEval":false}]}
2024-06-07 18:28:21.288 Info: [Event:id=LuxSearch] {"requestId":"6675450089992221355", "requestCompleted":false, "milliseconds":{"total":59396, "process":1}, "estimate":0, "page":1, "pageLength":101, "requestContext":"viaSearchFacet", "returned":0, "scope":"set", "changedScope":false, "criteria":{"AND":[{"containingItem":{"text":"booked", "_lang":"en"}}, {"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]}, "ignoredTerms":[], "query":"\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.prefixer(\"http://www.cidoc-crm.org/cidoc-crm/\");\nconst la = op.prefixer(\"https://linked.art/ns/terms/\");\nconst lux = op.prefixer(\"https://lux.collections.yale.edu/ns/\");\nconst skos = op.prefixer(\"http://www.w3.org/2004/02/skos/core#\");\n        const docs = fn.subsequence(\n          cts.search(cts.andQuery([\n          cts.jsonPropertyValueQuery('dataType', [\"Set\"], ['exact']),\n          cts.andQuery([cts.documentQuery(cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.fieldValueQuery(\n      [\"setTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d\"],\n      [\"exact\"],\n      1\n    )])\n        ]), [\"unfaceted\"]), \n          1, \n          101\n        ).toArray();\n        const results = [];\n        for (const doc of docs) {\n          results.push({id: doc.baseURI, type: doc.xpath('/json/type')});\n        }\n        results;"}
2024-06-07 18:28:21.289 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 59400 milliseconds: responsibleCollections
2024-06-07 18:28:21.289 Notice: JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj
2024-06-07 18:28:21.289 Notice:+in [anonymous], at 35:5 [javascript]
2024-06-07 18:28:21.289 Notice:+in ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]
2024-06-07 18:28:21.289 Notice:+in ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]
2024-06-07 18:28:21.289 Notice:+in ./searchLib.mjs, at 178:38, in _search() [javascript]
2024-06-07 18:28:21.289 Notice:+in ./searchLib.mjs, at 81:9, in search() [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs, at 12:9 [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs, at 11:0 [javascript]
2024-06-07 18:28:21.289 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:21.292 Warning: {"errorResponse":{"statusCode":500,"status":"Internal Server Error","messageCode":"JS-BAD","message":"Unexpected failure"}}

From 22 Jun 23:

20230622-dev-node-154-8003-ErrorLog-trimmed.txt:2023-06-22 16:39:56.106 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58206 milliseconds: responsibleUnits
20230622-dev-node-237-8003-ErrorLog-trimmed.txt:2023-06-22 16:39:57.087 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58221 milliseconds: responsibleCollections
20230622-dev-node-5-8003-ErrorLog-trimmed.txt:2023-06-22 17:17:21.023 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58536 milliseconds: responsibleCollections
20230622-dev-node-5-8003-ErrorLog-trimmed.txt:2023-06-22 17:17:21.290 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58796 milliseconds: responsibleUnits

cc: @clarkepeterf

brent-hartwig commented 2 weeks ago

[x] Acknowledged, understood, issue submitted, or resolved: submitted #177, covering one of two endpoints discussed herein.

Out of the remaining entries non-trace event entries found in the 8003 and 8004 application error logs, there are three JS-JAVASCRIPT instances. They all report Error running JavaScript request: null and --per the stack trace-- fail in either the searchWillMatch of facets data service. These data services have 9 - 14 lines of code. Not sure what would be null that isn't already protected. Perhaps they are failing within their respective library code? Perhaps ML failed to load the code? The associated access and request log entries do not provide additional insights. A bit of a mystery but also only three requests.

2024-06-07 17:27:14.115 Notice: JS-JAVASCRIPT: Error running JavaScript request: null
2024-06-07 17:27:14.115 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 17:27:14.115 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 17:27:14.123 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"BadRequestError","message":"null"}}

2024-06-07 18:14:29.560 Notice: JS-JAVASCRIPT: Error running JavaScript request: null
2024-06-07 18:14:29.560 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:14:29.560 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:14:29.563 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"BadRequestError","message":"null"}}

2024-06-07 18:29:05.201 Notice: JS-JAVASCRIPT: Error running JavaScript request: null
2024-06-07 18:29:05.201 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:29:05.201 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:29:05.205 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"BadRequestError","message":"null"}}

brent-hartwig commented 2 weeks ago

Request response times:

Endpoint	< 10 millis	< 100 millis	< 1 second	< 3 seconds	3-10 seconds	> 10 seconds	Total
Search	2.25%	32.81%	80.93%	100.00%	0.00%	0.00%	2,758
Facets	5.56%	26.83%	88.57%	95.54%	2.79%	1.66%	32,348
Related Lists	0.00%	0.00%	47.31%	92.65%	5.49%	1.87%	911
Search Will Match	0.71%	15.25%	60.79%	82.57%	2.18%	15.25%	22,760
Search Estimate	5.30%	35.37%	98.84%	100.00%	0.00%	0.00%	15,385
Document	97.89%	99.98%	100.00%	100.00%	0.00%	0.00%	169,433
Other endpoints							51,416
Total Requests							295,011

brent-hartwig commented 2 weeks ago

Mined log output:

2024-06-07-perf-test-request-metrics.xlsx. Included separately as this one is more likely to be updated than the others.
other-mined-logs-output.zip

brent-hartwig commented 2 weeks ago

Executive Summary

Facet request pagination improved system stability. There were no v8 engine crashes were recorded in the main error logs.
- Caveat A: There was fewer requests than expected yet without facet request pagination, we couldn't complete the performance smoke test without the v8 engine crashing.
- Caveat B: There were five JS-FATAL instances within the application error logs, which may be a different way v8 engine crashes are reported. All were associated to searchWillMatch requests. We may want to log more information about searchWillMatch requests and/or configure a system monitoring test in Blue/Green. Follow-up tickets: https://github.com/project-lux/lux-marklogic/issues/177 and https://git.yale.edu/lux-its/ml-cluster-formation/issues/19
There were thousands of 504s (gateway timeouts) per minute throughout the test. We believe the data service proxies automatically retried these requests and ended up propping up this test's success. Unknown to what extent this was true in previous tests. We may want to take steps to lighten the load on this facility. Follow-up tickets: https://github.com/project-lux/lux-middletier/issues/67 and #180.
It's still possible for requests to ask for too much data to be passed between ML nodes. Only two such requests during this test. Here too we may want a new system monitoring test to tell us how often this is happening in Blue/Green.
Despite the custom application server error handler being in place and evidence of three timed out semantic facet requests, ML never returned a 408 (timed out). Follow-up ticket: https://github.com/project-lux/lux-marklogic/issues/178
There were three instances when requests failed whereby the logs were light on details. We may just want to check for these in future performance tests.

gigamorph commented 2 weeks ago

From the middle tier application log:

There are some peculiar errors that may be worth noting. The root cause must be some error from in the frontend, but @brent-hartwig - would you say this error "Document is not JSON" was returned by MarkLogic?

failed search estimate for [object Object] - Document is not JSON

There were 1,366 of them.

Also interesting is that there were only(?) five JS-BAD errors and all of them were with the same data:

2024-06-07T18:32:31.256Z error: [121] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T17:30:33.371Z error: [64] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T18:52:31.782Z error: [80] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T18:26:59.301Z error: [81] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T18:58:53.379Z error: [65] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj

And there were 22 errors about retries for 504, e.g.,

failed to get doc for /data/place/067073a4-c71a-4238-8531-4de99805941c?profile=name - call to /ds/lux/document.mjs: retry failed for 504 status after 31 attempts over 121.204 seconds with path:

As previously discussed, this is the behavior of the marklogic client library that we cannot change, but 31 attempts over 121 seconds seems just too much, doesn't quite make sense to me.

brent-hartwig commented 2 weeks ago

[x] Acknowledged, understood, issue submitted, or resolved: see https://github.com/project-lux/lux-marklogic/issues/183 and monitoring test https://git.yale.edu/lux-its/ml-cluster-formation/issues/24.

@gigamorph, thanks for pointing out the 'document is not JSON' errors. I encountered them and thought I chased down the issue but must have forgotten about it. Here's a complete set of entries per instance:

2024-06-07 17:35:41.880 Notice: XDMP-JSONDOC: Document is not JSON
2024-06-07 17:35:41.880 Notice:+in /ds/lux/searchEstimate.mjs [1.0-ml]
2024-06-07 17:35:41.884 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"XDMP-JSONDOC","message":"Document is not JSON"}}

Based the grep results/count of the middle line, all instances occurred within searchEstimate.mjs, leading me to believe something in the performance test is compelling the middle tier to send invalid search criteria. Undoubtedly there are other possibilities but I don't know that the backend logs will help with that determination.

brent-hartwig commented 2 weeks ago

@gigamorph, the five JS-BAD errors that reached the middle tier are a subset of all JS-BAD errors reported in the backend. These particular five are failed searchWillMatch requests. I'm yet to determine why they failed, and here too, the backend logs may not provide additional insights. Please see this comment for the original write-up.

brent-hartwig commented 2 weeks ago

@gigamorph,

As previously discussed, this is the behavior of the marklogic client library that we cannot change, but 31 attempts over 121 seconds seems just too much, doesn't quite make sense to me.

Agreed. Since the backend logs document https://lux.collections.yale.edu/data/place/067073a4-c71a-4238-8531-4de99805941c was successfully found 119 times, I suspect some to all of the "failed to get doc" errors (all URIs) recorded by the middle tier were caused by a MarkLogic application server request queue being full for the majority of the test. The original comment associating 504s to a full app server queue may be found here. And the associated follow-up tickets are https://github.com/project-lux/lux-middletier/issues/67 and https://github.com/project-lux/lux-marklogic/issues/180.

That aside, we have also discussed submitting a support ticket / RFE requesting more control over data service retry behavior. I'm not opposed to doing so but wouldn't use this example. Hopefully the middle tier receives 504s (or 503s) when an app server request queue is full. Such requests should be retried.

One of 119 successfully requests for the above document:

2024-06-07 17:41:43.220 Info: [Event:id=LuxNamedProfiles] Applied the 'name' profile to 'https://lux.collections.yale.edu/data/place/067073a4-c71a-4238-8531-4de99805941c' in 1 milliseconds.

brent-hartwig commented 2 weeks ago

Related to there being fewer than expected backend requests...

Frontend to Backend Request Ratio

During a performance test, how many frontend requests should there be for every backend request? We have never attempted to establish this metric. We would either have to account for requests the frontend makes to other sources (such as for images and analytics) or include them.

When included, this test has 105% to 202% more frontend requests compared to the 14 Feb 24 and 22 Jun 23 tests, respectively. Since SI was introduced in Mar 2024, we can't really compare to tests older than that.

Perhaps we can exclude those other requests. We asked QA if their test logs can tell us how many middle tier requests there were, ideally by status code. (Middle tier requests have paths beginning with "/api/" or "/data/".)

Middle Tier to Backend Request Ratio

@gigamorph was able to determine there were 363,788 middle tier requests during the test via:

SELECT  COUNT(*) FROM lux_cloudfront_tst
WHERE date = DATE '2024-06-07' AND time > '17:02:00' AND time < '19:16:30'
AND (cs_uri_stem LIKE '/api/%' OR cs_uri_stem LIKE '/data/%');

Given most if not all middle tier requests result in multiple backend requests, neither of us would expect more middle tier requests than backend requests, yet that's what the counts look like as there were only 307,511 backend requests. We do not yet know why.

Middle tier request counts by status code:

Status code	Requests	Percent	Requests	Percent
0	66,949	18.40%
200	178,199	48.98%
304	9,700	2.67%
Sub-total: through 300s			254,848	70.05%


400	1,218	0.33%
404	755	0.21%
500	12	0.00%
502	485	0.13%
503	106,465	29.27%
504	5	0.00%
Sub-total: 400s and 500s			108,940	29.95%
Totals	363,788		363,788

Questions:

Were requests with status code 0 okay? I'm leaning towards yes. From https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html (h/t Seong), "...indicates that the viewer closed the connection before the server could respond to the request." This could be an indication of a VU transaction ending and not waiting for the rest of the transaction's responses yet still considering the transaction successful.
Why was 304 (not modified) returned for nearly 10K requests when caching should be disabled? Answered here!
Why was the performance test's error rate only 4.41% when 29.95% of the requests had a status response code in the 400s or 500s?

Sampling of requests with 503 response status codes:

kamerynB commented 2 weeks ago

@brent-hartwig SI was implemented in March. For almost every clickable action, there is an SI event being pushed. I'm not sure if that helps though.

brent-hartwig commented 2 weeks ago

Thanks, @kamerynB. It does. I updated my comment accordingly.

brent-hartwig commented 2 weeks ago

Why was 304 (not modified) returned for nearly 10K requests when caching should be disabled?

Seong researched and provided the following explanation. I agree we do not need to pursue this. In this test, they only accounted for 3% of the middle tier responses.

The node express library that middle tier uses as the web middleware, automatically generates and attaches an etag to each response. ETag is basically hash of the response data that enables the client to determine if the content has changed since the last request. Any well-established client including the web browsers will include an "If-None-Match: ${etag}" header in its requests telling the server to only send the full response (w/ 200) if the etag for the new response is not the previous one that the client(browser) cache currently has. Otherwise the server will just send back the 304 code with an empty body to save the traffic load.

So, that's why you see a lot of 304s in the browser network console. And the thing is, CloudFront and the web cache (Varnish) will each likely do the same thing, so if we need to (I doubt we need), it is not easy to disable all this 304 responses.

brent-hartwig commented 1 week ago

Decisions on executive summary items from conversation with @jffcamp and @prowns today:

No. 1's caveat A and no. 2: By 17 Jun 24, Seong and Xinjian enabled logging on all load balancers (frontend, web cache, middle tier, and MarkLogic in Blue, Green, and Dev). This is expected to provide additional request/response insights by status code.
No. 1's caveat B: Brent to address #177 followed by Xinjian implementing https://git.yale.edu/lux-its/ml-cluster-formation/issues/19.
No. 3: Brent to submit a ticket. We want to know if this is happening in PROD, and details about such requests. TBD if additional logging is needed for such insights versus only knowing it happened.
No. 4: Leave for now.
No. 5: The searchWillMatch instances will be helped by #177. There was also a facets instance, that would not be.

brent-hartwig commented 1 week ago

Closing this ticket as all findings have been "acknowledged, understood, issue submitted, or resolved". For the most part, links were added to #181 comments that continue the associated investigation or links to new tickets were added.

project-lux / lux-marklogic