project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Performance Test - scheduled for 2024-06-07 #162

Closed brent-hartwig closed 1 week ago

brent-hartwig commented 1 month ago

Primary objective: Test the system stability and performance impact of #160. Given previous analysis, we believe #160 will:

  1. Improve system stability when compared with previous 2024 performance tests when instances of the v8 engine crashed early on an frequently. We are looking to get much closer to June 2023's Scenario J performance test when there were zero v8 engine crashes. This was later repeated by Scenario L (another run of Scenario J).
  2. Reduce facet response time, compared to https://git.yale.edu/lux-its/marklogic/issues/1164's original implementation.
  3. Reduce facet response size, compared to https://git.yale.edu/lux-its/marklogic/issues/1164's original implementation.

If available, #161 can be part of this performance test as well. Given the aforementioned analysis was of a search for works, semantic facets were not involved; as such, it is welcome but not required by this test.

Code and Configuration Changes: Required: implementation of #160 and associated frontend and middle tier implementations. Optional: implementation of #161; if included, its associated frontend and middle tier implementations are required.

Environment and versions (update as needed):

Scenario AK of the Perf Test Line Up: "See if paginating facet requests (#160 and #161) restores system stability by reducing --if not eliminating-- v8 engine crashes."

Backend application server configuration:

For more information please see the documentation: LUX Performance Testing Procedure

Tasks:

Collect Data (Details from procedure):

Restore and Verify Environment:

Analyze:

Notes:

brent-hartwig commented 3 weeks ago

The specific backend branch being tested is 161-paginate-via-search-facet-requests. I believe it to be a branch of 160-paginate-non-semantic-facet-requests, which is a branch of release1.18.

In case this test is a raging success and we have time for a second test, we could re-run https://github.com/project-lux/lux-marklogic/issues/132: include alternative names when resolving name search criteria. I created the 161-pagination-PLUS-100-alternative-names branch which is the above plus #100's edits.

cc: @gigamorph, @kamerynB, @prowns, @jffcamp, @clarkepeterf

gigamorph commented 3 weeks ago

Middle tier stats: 20240607-tst-middle-tier-stats.txt

gigamorph commented 3 weeks ago

ML load balancer:

Screenshot 2024-06-07 at 3 46 15 PM

gigamorph commented 3 weeks ago

CPU utilization of ML nodes: Screenshot 2024-06-07 at 4 00 04 PM

504s from ML load balancer: 504-ml

504s from web cache: 504-webcache

brent-hartwig commented 3 weeks ago

Trimmed backend log files:

20240607-green-as-test-node-104-8003-AccessLog-trimmed.zip 20240607-green-as-test-node-104-8003-ErrorLog-trimmed.zip 20240607-green-as-test-node-104-8003-RequestLog-trimmed.zip 20240607-green-as-test-node-104-8004-AccessLog-trimmed.zip 20240607-green-as-test-node-104-8004-ErrorLog-trimmed.zip 20240607-green-as-test-node-104-8004-RequestLog-trimmed.zip 20240607-green-as-test-node-104-ErrorLog-trimmed.zip 20240607-green-as-test-node-73-8003-AccessLog-trimmed.zip 20240607-green-as-test-node-73-8003-ErrorLog-trimmed.zip 20240607-green-as-test-node-73-8003-RequestLog-trimmed.zip 20240607-green-as-test-node-73-8004-AccessLog-trimmed.zip 20240607-green-as-test-node-73-8004-ErrorLog-trimmed.zip 20240607-green-as-test-node-73-8004-RequestLog-trimmed.zip 20240607-green-as-test-node-73-ErrorLog-trimmed.zip 20240607-green-as-test-node-20-8003-AccessLog-trimmed.zip 20240607-green-as-test-node-20-8003-ErrorLog-trimmed.zip 20240607-green-as-test-node-20-8003-RequestLog-trimmed.zip 20240607-green-as-test-node-20-8004-AccessLog-trimmed.zip 20240607-green-as-test-node-20-8004-ErrorLog-trimmed.zip 20240607-green-as-test-node-20-8004-RequestLog-trimmed.zip 20240607-green-as-test-node-20-ErrorLog-trimmed.zip

brent-hartwig commented 3 weeks ago

There were fewer backend requests than expected --yet there are not significant differences between the total number of executed Virtual User (VU) flows/transactions, overall error rate, and average response times with this test's baseline (22 Jun 23) and a more recent test (14 Feb 24). Test duration doesn't explain it either. Comparison:

J (1033) AG (1145) AK AK vs. J AK vs. AG
Date 22 Jun 23 14 Feb 24 7 Jun 24 N/A N/A
App server thread count(s) 6 & 12 18 6 & 12 N/A N/A
Backend search request count (200 status code) 27,438 12,214 2,758 -90% -77%
Backend facet request count (200 status code) 82,687 63,367 32,348 -61% -49%
Total backend request count (all status codes) 1,016,344 688,599 307,511 -70% -55%
Frontend response count with status code in the 200s 1,046,953 1,046,953 830,098 -21% -21%
Total frontend response count (all status codes) 1,173,748 1,173,748 1,073,190 -9% -9%
Backend requests per frontend response 0.87 0.59 0.29 -67% -51%
Frontend response per backend request 1.15 1.70 3.49 202% 105%
Total middle tier request count (all status codes) UNK UNK 363,788 UNK UNK
Backend requests per middle tier request UNK UNK 0.85 UNK UNK
Middle tier requests per backend request UNK UNK 1.18 UNK UNK
Total number of executed VU flows 28,573 36,729 32,230 3,657 -4,499
Error rate 2.42% 5.42% 4.41% 1.99% -1.01%
Average, unweighted VU flow response time (secs) 2.262 2.941 3.297 1.035 0.356
Test duration (HH:MM) 1:55* 2:12 2:15 00:20* 00:03

If the system isn't being pushed as hard as previous tests, the results of this test could be skewed. Regardless, today's test was so much better than those on 9 May 24 (#132) and 22 May 24 (#151) as there were zero v8 engine crashes!

Possible explanations thought of to date (not mutually-exclusive):

  1. Fewer backend requests for select frontend requests. Facts diluting this possibility:
    • The search estimate endpoint was changed to only accept the criteria for one search per request.
    • There were 9.5K to 24.5K fewer frontend searches during this test, based on which test you compare to.
    • A single frontend search for works makes 197 backend requests.
    • (9.5K * 197 is over 1.8M, which far exceeds 22 Jun 23's total request count of 1.0M)
  2. Performance test changed. There was a decrease in the number of responses given to the VUs. Unknown if that indicates fewer requests we made due to a change(s) to the performance test, the frontend making fewer requests, or a mix of both. The overall frontend decrease of 9% is smaller than the 55% - 70% backend decrease. The ratio between frontend responses and backend requests reduced significantly but the backend is not fulfilling all frontend requests, including images and analytics.
  3. One or more caches between the browser and MarkLogic's load balancer. This comment digs into the number of frontend and middle tier requests more.
  4. What else?

* Logs were trimmed to 1:55. The QA report states the test ran for 2:15. Found a note that read "Test stopped after full ramp up but did not run for the full 15 minutes afterwards." Regardless, for the purposes of this comparison, 22 Jun 23's test running shorter would skew the results to the benefit of 7 Jun 24's test.

brent-hartwig commented 3 weeks ago

@gigamorph noticed early on that there were thousands of 504s (gateway timeouts) from the ML load balancer early on. As shown above, that persisted throughout the test but 504s at the web cache later never exceeded 100 per minute. The likely explanation is that the data service proxies automatically retried the 504s. Looking at this and previous tests' app server metrics, a contributing factor could be a full request queue, for the lux-request-group-1 app server. That app server's queue reported being full 82% of the minutely data points, implying it was maxed out for quite a bit of the test. Given the other app server's queue was not used at all, we should consider redirecting one or more request types to it.

brent-hartwig commented 2 weeks ago

ML Monitoring History

Time period: 1700 - 1920 UTC

CPU:

image

File IO Detail:

image

Memory:

image

Intra-cluster activity, 1 of 2:

image

Intra-cluster activity, 2 of 2:

image

Both HTTP app servers:

image

Just the lux-request-group-1 app server (all requests here except search and relatedList requests, up to 6 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

image

Just the lux-request-group-2 app server (only search and relatedList requests, up to 12 threads per node):

Note the difference in the y axes for lux-request-group-1 and lux-request-group-2.

image

Data node characteristics for the lux-content database alone:

image

brent-hartwig commented 2 weeks ago

ML monitoring history exports:

network-detail-20240610-130259.xls servers-detail-20240610-130334.xls xdqp-server requests detail-20240610-130107.xls cpu-detail-20240610-124745.xls databases-detail-20240610-131620.xls file-i_o detail-20240610-125223.xls memory-detail-20240610-130034.xls

brent-hartwig commented 2 weeks ago

ML App Server Metrics

The following reflects group one's queue was full for the majority of the test. This could explain many of the 504s the MarkLogic load balancer is returning. Please see this above comment for more details and suggested follow-up.

Script: extractAppServerQueueMetrics.js*

Input: 20240607-1700-1920-app-server-stats.json

Output:

[
  {
    "appServer":"lux-request-group-1",
    "largestQueueSize":36,
    "largestQueueSizeCount":115,
    "greaterThanOneCount":125,
    "dataPointCount":141,
    "totalQueuedRequestCount":4327,
    "values":"omitted per script configuration"
  },
  {
    "appServer":"lux-request-group-2",
    "largestQueueSize":0,
    "largestQueueSizeCount":141,
    "greaterThanOneCount":0,
    "dataPointCount":141,
    "totalQueuedRequestCount":0,
    "values":"omitted per script configuration"
  }
]

* The input includes one instance of a queue size of 37, which exceeds the maximum. I changed the script to use 36 for group one, then added one to that total to account for the single report of 37.

brent-hartwig commented 2 weeks ago

OS-level metrics (via sar):

These were downloaded from Green's nodes but filenames do not reflect nodes 104, 73, 20.

sar_ip-10-5-156-171.its.yale.edu_2024-06-07T165241.out.gz sar_ip-10-5-156-171.its.yale.edu_2024-06-07T165241_screen.out.gz sar_ip-10-5-157-195.its.yale.edu_2024-06-07T165318.out.gz sar_ip-10-5-157-195.its.yale.edu_2024-06-07T165318_screen.out.gz sar_ip-10-5-254-21.its.yale.edu_2024-06-07T165336.out.gz sar_ip-10-5-254-21.its.yale.edu_2024-06-07T165336_screen.out.gz

brent-hartwig commented 2 weeks ago

Looks like two requests failed due to an attempt to send too much data between nodes:

20240607-green-as-test-node-104-ErrorLog-trimmed.txt:2024-06-07 17:43:31.069 Warning: XDQPServerConnection::init: 172.18.94.14 XDMP-XDQPINVREQ: Invalid XDQP request, size=5388703008, client=0, conn=10.5.156.104:7999-172.18.94.14:50564 20240607-green-as-test-node-73-ErrorLog-trimmed.txt:2024-06-07 18:29:36.054 Warning: XDQPServerConnection::init: 172.18.94.12 XDMP-XDQPINVREQ: Invalid XDQP request, size=5388703008, client=0, conn=10.5.157.73:7999-172.18.94.12:48512

Which line up with these entries:

20240607-green-as-test-node-104-8004-ErrorLog-trimmed.txt:2024-06-07 17:43:36.153 Notice: SVC-SOCPEER: Socket peer name error: getpeername 10.5.156.104:8004: Transport endpoint is not connected 20240607-green-as-test-node-73-8004-ErrorLog-trimmed.txt:2024-06-07 18:29:45.069 Notice: SVC-SOCPEER: Socket peer name error: getpeername 10.5.157.73:8004: Transport endpoint is not connected

brent-hartwig commented 2 weeks ago

Encountered five JS-FATAL errors. They all match the following pattern, meaning they were all for searchWillMatch requests. These were found in the 8003 application log files, as opposed to the v8 engine crash log entries we're accustomed to finding in the main error logs.

2024-06-07 18:26:59.300 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
2024-06-07 18:26:59.300 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:26:59.300 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:26:59.300 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:26:59.300 Notice: JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07 18:26:59.300 Notice:+in /luxErrorHandler.mjs [javascript]

Nodes and times associated to all instances:

$ grep JS-FATAL 20240607-green-as-test-node-*-800[3-4]-ErrorLog-trimmed.txt
20240607-green-as-test-node-104-8003-ErrorLog-trimmed.txt:2024-06-07 17:30:33.370 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-104-8003-ErrorLog-trimmed.txt:2024-06-07 18:32:31.255 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-104-8003-ErrorLog-trimmed.txt:2024-06-07 18:58:53.361 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-20-8003-ErrorLog-trimmed.txt:2024-06-07 18:52:31.780 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()
20240607-green-as-test-node-73-8003-ErrorLog-trimmed.txt:2024-06-07 18:26:59.300 Notice: JS-FATAL: xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) -- Request has fatal error: v8::Object::SetAlignedPointerInInternalField()

cc: @clarkepeterf

brent-hartwig commented 2 weeks ago

While there is a JS-BAD instance for each above-discussed JS-FATAL instance, this comment speaks to the other JS-BAD instances. This subset is related to semantic facet requests. Each instance is provided below. Because they all fail after one minute and the request and app server timeouts are both 59 seconds, I believe these are simply timed-out requests. I also checked 22 Jun 23's logs and found four instances; just the "Failed to calculate" log entries are included below, at the end.

The response status code for all was 500, meaning the data service proxies would not have retried them.

2024-06-07 18:27:16.422 Info: [Event:id=LuxFacets] Searching for the 'responsibleUnits' facet values.
2024-06-07 18:27:16.423 Info: [Event:id=LuxSearch] Search 4295481604482034528 parameters: {"searchCriteria":{"_scope":"agent","OR":[{"memberOfInverse":{"AND":[{"curated":{"containingItem":{"text":"booked","_lang":"en"}}},{"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}]}},{"AND":[{"curated":{"containingItem":{"text":"booked","_lang":"en"}}},{"NOT":{"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}}]}]},"searchScope":null,"mayChangeScope":false,"page":1,"pageLength":101,"requestContext":"viaSearchFacet","mayExceedMaximumPageLength":true,"mayEstimate":true,"sortDelimitedStr":"","facetsSoon":false,"synonymsEnabled":false}
...
2024-06-07 18:28:15.099 Info: [Event:id=LuxSearch] Search 4295481604482034528 errored out: {"stack":"JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj\nin [anonymous], at 44:5 [javascript]\nin ./utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]\nin ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]\nin ./searchLib.mjs, at 178:38, in _search() [javascript]\nin ./searchLib.mjs, at 81:9, in search() [javascript]\nin ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]\nin ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]\nin ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]\nin /ds/lux/facets.mjs, at 12:9 [javascript]\nin ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]\nin /ds/lux/facets.mjs, at 11:0 [javascript]\nin xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"}) [javascript]\nin /ds/lux/facets.mjs [javascript]\nin /ds/lux/facets.mjs [javascript]","mlerr":{},"data":["checkMaybeLocal","wrapValue valObj"],"message":"Unexpected failure","name":"JS-BAD","retryable":false,"stackFrames":[{"uri":"[anonymous]","line":44,"column":5,"operation":".toArray()","language":"javascript"},{"uri":"./utils/utils.mjs","line":111,"column":14,"operation":"evalInContentDatabase","language":"javascript","isEval":false},{"uri":"./SearchCriteriaProcessor.mjs","line":188,"column":17,"operation":"getSearchResults","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":178,"column":38,"operation":"_search","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":81,"column":9,"operation":"search","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":313,"column":17,"operation":"_getViaSearchFacetValues","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":269,"column":17,"operation":"_getViaSearchFacets","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":52,"column":37,"operation":"getFacets","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":12,"column":9,"language":"javascript","isEval":false},{"uri":"../../lib/requestHandleLib.mjs","line":10,"column":9,"operation":"handleRequest","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":11,"column":0,"operation":"xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"})","language":"javascript","isEval":false}]}
2024-06-07 18:28:15.099 Info: [Event:id=LuxSearch] {"requestId":"4295481604482034528", "requestCompleted":false, "milliseconds":{"total":58676, "process":1}, "estimate":0, "page":1, "pageLength":101, "requestContext":"viaSearchFacet", "returned":0, "scope":"agent", "changedScope":false, "criteria":{"OR":[{"memberOfInverse":{"AND":[{"curated":{"containingItem":{"text":"booked", "_lang":"en"}}}, {"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}]}}, {"AND":[{"curated":{"containingItem":{"text":"booked", "_lang":"en"}}}, {"NOT":{"classification":{"id":"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3"}}}]}]}, "ignoredTerms":[], "query":"\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.prefixer(\"http://www.cidoc-crm.org/cidoc-crm/\");\nconst la = op.prefixer(\"https://linked.art/ns/terms/\");\nconst lux = op.prefixer(\"https://lux.collections.yale.edu/ns/\");\nconst skos = op.prefixer(\"http://www.w3.org/2004/02/skos/core#\");\n        const docs = fn.subsequence(\n          cts.search(cts.andQuery([\n          cts.jsonPropertyValueQuery('dataType', [\"Person\", \"Group\"], ['exact']),\n          cts.orQuery([cts.documentQuery(cts\n    .triples(\n      [],\n      [crm(\"P107i_is_current_or_former_member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.andQuery([cts.documentQuery(cts\n    .triples(\n      cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist')),\n      [lux(\"agentOfCuration\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      undefined\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.fieldValueQuery(\n      [\"agentTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3\"],\n      [\"exact\"],\n      1\n    )])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.andQuery([cts.documentQuery(cts\n    .triples(\n      cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist')),\n      [lux(\"agentOfCuration\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      undefined\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.notQuery(cts.fieldValueQuery(\n      [\"agentTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/5c749df9-4ae0-48ad-b8e3-f120af5689c3\"],\n      [\"exact\"],\n      1\n    ))])])\n        ]), [\"unfaceted\"]), \n          1, \n          101\n        ).toArray();\n        const results = [];\n        for (const doc of docs) {\n          results.push({id: doc.baseURI, type: doc.xpath('/json/type')});\n        }\n        results;"}
2024-06-07 18:28:15.099 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58690 milliseconds: responsibleUnits
2024-06-07 18:28:15.100 Notice: JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj
2024-06-07 18:28:15.100 Notice:+in [anonymous], at 44:5 [javascript]
2024-06-07 18:28:15.100 Notice:+in ./utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]
2024-06-07 18:28:15.100 Notice:+in ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]
2024-06-07 18:28:15.100 Notice:+in ./searchLib.mjs, at 178:38, in _search() [javascript]
2024-06-07 18:28:15.100 Notice:+in ./searchLib.mjs, at 81:9, in search() [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs, at 12:9 [javascript]
2024-06-07 18:28:15.100 Notice:+in ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs, at 11:0 [javascript]
2024-06-07 18:28:15.100 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:15.100 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:15.102 Warning: {"errorResponse":{"statusCode":500,"status":"Internal Server Error","messageCode":"JS-BAD","message":"Unexpected failure"}}
2024-06-07 18:26:22.219 Info: [Event:id=LuxFacets] Searching for the 'responsibleCollections' facet values.
2024-06-07 18:26:22.219 Info: [Event:id=LuxSearch] Search 14325137610764299934 parameters: {"searchCriteria":{"_scope":"set","AND":[{"containingItem":{"text":"booked","_lang":"en"}},{"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]},"searchScope":null,"mayChangeScope":false,"page":1,"pageLength":101,"requestContext":"viaSearchFacet","mayExceedMaximumPageLength":true,"mayEstimate":true,"sortDelimitedStr":"","facetsSoon":false,"synonymsEnabled":false}
...
2024-06-07 18:27:21.160 Info: [Event:id=LuxSearch] Search 14325137610764299934 errored out: {"stack":"JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj\nin [anonymous], at 35:5 [javascript]\nin ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]\nin ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]\nin ./searchLib.mjs, at 178:38, in _search() [javascript]\nin ./searchLib.mjs, at 81:9, in search() [javascript]\nin ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]\nin ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]\nin ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]\nin /ds/lux/facets.mjs, at 12:9 [javascript]\nin ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]\nin /ds/lux/facets.mjs, at 11:0 [javascript]\nin xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"}) [javascript]\nin /ds/lux/facets.mjs [javascript]\nin /ds/lux/facets.mjs [javascript]","mlerr":{},"data":["checkMaybeLocal","wrapValue valObj"],"message":"Unexpected failure","name":"JS-BAD","retryable":false,"stackFrames":[{"uri":"[anonymous]","line":35,"column":5,"operation":".toArray()","language":"javascript"},{"uri":"../utils/utils.mjs","line":111,"column":14,"operation":"evalInContentDatabase","language":"javascript","isEval":false},{"uri":"./SearchCriteriaProcessor.mjs","line":188,"column":17,"operation":"getSearchResults","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":178,"column":38,"operation":"_search","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":81,"column":9,"operation":"search","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":313,"column":17,"operation":"_getViaSearchFacetValues","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":269,"column":17,"operation":"_getViaSearchFacets","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":52,"column":37,"operation":"getFacets","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":12,"column":9,"language":"javascript","isEval":false},{"uri":"../../lib/requestHandleLib.mjs","line":10,"column":9,"operation":"handleRequest","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":11,"column":0,"operation":"xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"})","language":"javascript","isEval":false}]}
2024-06-07 18:27:21.160 Info: [Event:id=LuxSearch] {"requestId":"14325137610764299934", "requestCompleted":false, "milliseconds":{"total":58941, "process":1}, "estimate":0, "page":1, "pageLength":101, "requestContext":"viaSearchFacet", "returned":0, "scope":"set", "changedScope":false, "criteria":{"AND":[{"containingItem":{"text":"booked", "_lang":"en"}}, {"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]}, "ignoredTerms":[], "query":"\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.prefixer(\"http://www.cidoc-crm.org/cidoc-crm/\");\nconst la = op.prefixer(\"https://linked.art/ns/terms/\");\nconst lux = op.prefixer(\"https://lux.collections.yale.edu/ns/\");\nconst skos = op.prefixer(\"http://www.w3.org/2004/02/skos/core#\");\n        const docs = fn.subsequence(\n          cts.search(cts.andQuery([\n          cts.jsonPropertyValueQuery('dataType', [\"Set\"], ['exact']),\n          cts.andQuery([cts.documentQuery(cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.fieldValueQuery(\n      [\"setTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d\"],\n      [\"exact\"],\n      1\n    )])\n        ]), [\"unfaceted\"]), \n          1, \n          101\n        ).toArray();\n        const results = [];\n        for (const doc of docs) {\n          results.push({id: doc.baseURI, type: doc.xpath('/json/type')});\n        }\n        results;"}
2024-06-07 18:27:21.161 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58944 milliseconds: responsibleCollections
2024-06-07 18:27:21.161 Notice: JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj
2024-06-07 18:27:21.161 Notice:+in [anonymous], at 35:5 [javascript]
2024-06-07 18:27:21.161 Notice:+in ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]
2024-06-07 18:27:21.161 Notice:+in ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]
2024-06-07 18:27:21.161 Notice:+in ./searchLib.mjs, at 178:38, in _search() [javascript]
2024-06-07 18:27:21.161 Notice:+in ./searchLib.mjs, at 81:9, in search() [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs, at 12:9 [javascript]
2024-06-07 18:27:21.161 Notice:+in ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs, at 11:0 [javascript]
2024-06-07 18:27:21.161 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:27:21.161 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:27:21.165 Warning: {"errorResponse":{"statusCode":500,"status":"Internal Server Error","messageCode":"JS-BAD","message":"Unexpected failure"}}
2024-06-07 18:27:21.891 Info: [Event:id=LuxFacets] Searching for the 'responsibleCollections' facet values.
2024-06-07 18:27:21.892 Info: [Event:id=LuxSearch] Search 6675450089992221355 parameters: {"searchCriteria":{"_scope":"set","AND":[{"containingItem":{"text":"booked","_lang":"en"}},{"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]},"searchScope":null,"mayChangeScope":false,"page":1,"pageLength":101,"requestContext":"viaSearchFacet","mayExceedMaximumPageLength":true,"mayEstimate":true,"sortDelimitedStr":"","facetsSoon":false,"synonymsEnabled":false}
...
2024-06-07 18:28:21.288 Info: [Event:id=LuxSearch] Search 6675450089992221355 errored out: {"stack":"JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj\nin [anonymous], at 35:5 [javascript]\nin ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]\nin ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]\nin ./searchLib.mjs, at 178:38, in _search() [javascript]\nin ./searchLib.mjs, at 81:9, in search() [javascript]\nin ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]\nin ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]\nin ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]\nin /ds/lux/facets.mjs, at 12:9 [javascript]\nin ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]\nin /ds/lux/facets.mjs, at 11:0 [javascript]\nin xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"}) [javascript]\nin /ds/lux/facets.mjs [javascript]\nin /ds/lux/facets.mjs [javascript]","mlerr":{},"data":["checkMaybeLocal","wrapValue valObj"],"message":"Unexpected failure","name":"JS-BAD","retryable":false,"stackFrames":[{"uri":"[anonymous]","line":35,"column":5,"operation":".toArray()","language":"javascript"},{"uri":"../utils/utils.mjs","line":111,"column":14,"operation":"evalInContentDatabase","language":"javascript","isEval":false},{"uri":"./SearchCriteriaProcessor.mjs","line":188,"column":17,"operation":"getSearchResults","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":178,"column":38,"operation":"_search","language":"javascript","isEval":false},{"uri":"./searchLib.mjs","line":81,"column":9,"operation":"search","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":313,"column":17,"operation":"_getViaSearchFacetValues","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":269,"column":17,"operation":"_getViaSearchFacets","language":"javascript","isEval":false},{"uri":"../../lib/facetsLib.mjs","line":52,"column":37,"operation":"getFacets","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":12,"column":9,"language":"javascript","isEval":false},{"uri":"../../lib/requestHandleLib.mjs","line":10,"column":9,"operation":"handleRequest","language":"javascript","isEval":false},{"uri":"/ds/lux/facets.mjs","line":11,"column":0,"operation":"xdmp:eval(\"\\nconst op = require(\\\"/MarkLogic/optic\\\");\\nconst crm = op.pref...\", {}, {database:\"7947491399340755119\", update:\"false\"})","language":"javascript","isEval":false}]}
2024-06-07 18:28:21.288 Info: [Event:id=LuxSearch] {"requestId":"6675450089992221355", "requestCompleted":false, "milliseconds":{"total":59396, "process":1}, "estimate":0, "page":1, "pageLength":101, "requestContext":"viaSearchFacet", "returned":0, "scope":"set", "changedScope":false, "criteria":{"AND":[{"containingItem":{"text":"booked", "_lang":"en"}}, {"classification":{"id":"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d"}}]}, "ignoredTerms":[], "query":"\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.prefixer(\"http://www.cidoc-crm.org/cidoc-crm/\");\nconst la = op.prefixer(\"https://linked.art/ns/terms/\");\nconst lux = op.prefixer(\"https://lux.collections.yale.edu/ns/\");\nconst skos = op.prefixer(\"http://www.w3.org/2004/02/skos/core#\");\n        const docs = fn.subsequence(\n          cts.search(cts.andQuery([\n          cts.jsonPropertyValueQuery('dataType', [\"Set\"], ['exact']),\n          cts.andQuery([cts.documentQuery(cts\n    .triples(\n      [],\n      [la(\"member_of\")],\n      [],\n      '=',\n      ['eager', 'concurrent'],\n      cts.orQuery([\n      cts.fieldWordQuery([\"itemAnyText\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1),\n      cts.tripleRangeQuery(\n    [],\n    [lux(\"itemAny\")],\n    fn.insertBefore(\n    cts.values(\n      cts.iriReference(),\n      '',\n      ['eager', 'concurrent'],\n      cts.fieldWordQuery([\"referencePrimaryName\"], \"booked\", [\"case-insensitive\", \"diacritic-insensitive\", \"punctuation-insensitive\", \"whitespace-insensitive\", \"stemmed\", \"wildcarded\"], 1)\n    ), \n    0,\n    sem.iri('/does/not/exist')\n  ), '=', [], 1\n  )\n    ])\n    )\n    .toArray()\n    .map((x) => sem.tripleObject(x))\n    .concat(sem.iri('/does/not/exist'))), cts.fieldValueQuery(\n      [\"setTypeId\"],\n      [\"https://lux.collections.yale.edu/data/concept/2340bd83-22b4-4a39-ae46-ba0258e3a79d\"],\n      [\"exact\"],\n      1\n    )])\n        ]), [\"unfaceted\"]), \n          1, \n          101\n        ).toArray();\n        const results = [];\n        for (const doc of docs) {\n          results.push({id: doc.baseURI, type: doc.xpath('/json/type')});\n        }\n        results;"}
2024-06-07 18:28:21.289 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 59400 milliseconds: responsibleCollections
2024-06-07 18:28:21.289 Notice: JS-BAD: .toArray() -- Unexpected failure: checkMaybeLocal wrapValue valObj
2024-06-07 18:28:21.289 Notice:+in [anonymous], at 35:5 [javascript]
2024-06-07 18:28:21.289 Notice:+in ../utils/utils.mjs, at 111:14, in evalInContentDatabase() [javascript]
2024-06-07 18:28:21.289 Notice:+in ./SearchCriteriaProcessor.mjs, at 188:17, in getSearchResults() [javascript]
2024-06-07 18:28:21.289 Notice:+in ./searchLib.mjs, at 178:38, in _search() [javascript]
2024-06-07 18:28:21.289 Notice:+in ./searchLib.mjs, at 81:9, in search() [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/facetsLib.mjs, at 313:17, in _getViaSearchFacetValues() [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/facetsLib.mjs, at 269:17, in _getViaSearchFacets() [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/facetsLib.mjs, at 52:37, in getFacets() [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs, at 12:9 [javascript]
2024-06-07 18:28:21.289 Notice:+in ../../lib/requestHandleLib.mjs, at 10:9, in handleRequest() [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs, at 11:0 [javascript]
2024-06-07 18:28:21.289 Notice:+in xdmp:eval("\nconst op = require(\"/MarkLogic/optic\");\nconst crm = op.pref...", {}, {database:"7947491399340755119", update:"false"}) [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:21.289 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:28:21.292 Warning: {"errorResponse":{"statusCode":500,"status":"Internal Server Error","messageCode":"JS-BAD","message":"Unexpected failure"}}

From 22 Jun 23:

20230622-dev-node-154-8003-ErrorLog-trimmed.txt:2023-06-22 16:39:56.106 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58206 milliseconds: responsibleUnits
20230622-dev-node-237-8003-ErrorLog-trimmed.txt:2023-06-22 16:39:57.087 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58221 milliseconds: responsibleCollections
20230622-dev-node-5-8003-ErrorLog-trimmed.txt:2023-06-22 17:17:21.023 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58536 milliseconds: responsibleCollections
20230622-dev-node-5-8003-ErrorLog-trimmed.txt:2023-06-22 17:17:21.290 Info: [Event:id=LuxFacets] Failed to calculate the following facets after 58796 milliseconds: responsibleUnits

cc: @clarkepeterf

brent-hartwig commented 2 weeks ago

Out of the remaining entries non-trace event entries found in the 8003 and 8004 application error logs, there are three JS-JAVASCRIPT instances. They all report Error running JavaScript request: null and --per the stack trace-- fail in either the searchWillMatch of facets data service. These data services have 9 - 14 lines of code. Not sure what would be null that isn't already protected. Perhaps they are failing within their respective library code? Perhaps ML failed to load the code? The associated access and request log entries do not provide additional insights. A bit of a mystery but also only three requests.

2024-06-07 17:27:14.115 Notice: JS-JAVASCRIPT: Error running JavaScript request: null
2024-06-07 17:27:14.115 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 17:27:14.115 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 17:27:14.123 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"BadRequestError","message":"null"}}
2024-06-07 18:14:29.560 Notice: JS-JAVASCRIPT: Error running JavaScript request: null
2024-06-07 18:14:29.560 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:14:29.560 Notice:+in /ds/lux/searchWillMatch.mjs [javascript]
2024-06-07 18:14:29.563 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"BadRequestError","message":"null"}}
2024-06-07 18:29:05.201 Notice: JS-JAVASCRIPT: Error running JavaScript request: null
2024-06-07 18:29:05.201 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:29:05.201 Notice:+in /ds/lux/facets.mjs [javascript]
2024-06-07 18:29:05.205 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"BadRequestError","message":"null"}}
brent-hartwig commented 2 weeks ago

Request response times:

Endpoint < 10 millis < 100 millis < 1 second < 3 seconds 3-10 seconds > 10 seconds Total
Search 2.25% 32.81% 80.93% 100.00% 0.00% 0.00% 2,758
Facets 5.56% 26.83% 88.57% 95.54% 2.79% 1.66% 32,348
Related Lists 0.00% 0.00% 47.31% 92.65% 5.49% 1.87% 911
Search Will Match 0.71% 15.25% 60.79% 82.57% 2.18% 15.25% 22,760
Search Estimate 5.30% 35.37% 98.84% 100.00% 0.00% 0.00% 15,385
Document 97.89% 99.98% 100.00% 100.00% 0.00% 0.00% 169,433
Other endpoints 51,416
Total Requests 295,011
brent-hartwig commented 2 weeks ago

Mined log output:

brent-hartwig commented 2 weeks ago

Executive Summary

  1. Facet request pagination improved system stability. There were no v8 engine crashes were recorded in the main error logs.
  2. There were thousands of 504s (gateway timeouts) per minute throughout the test. We believe the data service proxies automatically retried these requests and ended up propping up this test's success. Unknown to what extent this was true in previous tests. We may want to take steps to lighten the load on this facility. Follow-up tickets: https://github.com/project-lux/lux-middletier/issues/67 and #180.
  3. It's still possible for requests to ask for too much data to be passed between ML nodes. Only two such requests during this test. Here too we may want a new system monitoring test to tell us how often this is happening in Blue/Green.
  4. Despite the custom application server error handler being in place and evidence of three timed out semantic facet requests, ML never returned a 408 (timed out). Follow-up ticket: https://github.com/project-lux/lux-marklogic/issues/178
  5. There were three instances when requests failed whereby the logs were light on details. We may just want to check for these in future performance tests.
gigamorph commented 2 weeks ago

From the middle tier application log:

There are some peculiar errors that may be worth noting. The root cause must be some error from in the frontend, but @brent-hartwig - would you say this error "Document is not JSON" was returned by MarkLogic?

failed search estimate for [object Object] - Document is not JSON

There were 1,366 of them.

Also interesting is that there were only(?) five JS-BAD errors and all of them were with the same data:

2024-06-07T18:32:31.256Z error: [121] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T17:30:33.371Z error: [64] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T18:52:31.782Z error: [80] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T18:26:59.301Z error: [81] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj
2024-06-07T18:58:53.379Z error: [65] failed to get doc for /data/set/1cbd9fc8-809a-4d36-9687-73f841c701e2 - JS-BAD: Unexpected failure: checkMaybeLocal wrapObjectNode objnObj

And there were 22 errors about retries for 504, e.g.,

failed to get doc for /data/place/067073a4-c71a-4238-8531-4de99805941c?profile=name - call to /ds/lux/document.mjs: retry failed for 504 status after 31 attempts over 121.204 seconds with path:

As previously discussed, this is the behavior of the marklogic client library that we cannot change, but 31 attempts over 121 seconds seems just too much, doesn't quite make sense to me.

brent-hartwig commented 2 weeks ago

@gigamorph, thanks for pointing out the 'document is not JSON' errors. I encountered them and thought I chased down the issue but must have forgotten about it. Here's a complete set of entries per instance:

2024-06-07 17:35:41.880 Notice: XDMP-JSONDOC: Document is not JSON
2024-06-07 17:35:41.880 Notice:+in /ds/lux/searchEstimate.mjs [1.0-ml]
2024-06-07 17:35:41.884 Warning: {"errorResponse":{"statusCode":400,"status":"Bad Request","messageCode":"XDMP-JSONDOC","message":"Document is not JSON"}}

Based the grep results/count of the middle line, all instances occurred within searchEstimate.mjs, leading me to believe something in the performance test is compelling the middle tier to send invalid search criteria. Undoubtedly there are other possibilities but I don't know that the backend logs will help with that determination.

brent-hartwig commented 2 weeks ago

@gigamorph, the five JS-BAD errors that reached the middle tier are a subset of all JS-BAD errors reported in the backend. These particular five are failed searchWillMatch requests. I'm yet to determine why they failed, and here too, the backend logs may not provide additional insights. Please see this comment for the original write-up.

brent-hartwig commented 2 weeks ago

@gigamorph,

As previously discussed, this is the behavior of the marklogic client library that we cannot change, but 31 attempts over 121 seconds seems just too much, doesn't quite make sense to me.

Agreed. Since the backend logs document https://lux.collections.yale.edu/data/place/067073a4-c71a-4238-8531-4de99805941c was successfully found 119 times, I suspect some to all of the "failed to get doc" errors (all URIs) recorded by the middle tier were caused by a MarkLogic application server request queue being full for the majority of the test. The original comment associating 504s to a full app server queue may be found here. And the associated follow-up tickets are https://github.com/project-lux/lux-middletier/issues/67 and https://github.com/project-lux/lux-marklogic/issues/180.

That aside, we have also discussed submitting a support ticket / RFE requesting more control over data service retry behavior. I'm not opposed to doing so but wouldn't use this example. Hopefully the middle tier receives 504s (or 503s) when an app server request queue is full. Such requests should be retried.

One of 119 successfully requests for the above document:

2024-06-07 17:41:43.220 Info: [Event:id=LuxNamedProfiles] Applied the 'name' profile to 'https://lux.collections.yale.edu/data/place/067073a4-c71a-4238-8531-4de99805941c' in 1 milliseconds.

brent-hartwig commented 2 weeks ago

Related to there being fewer than expected backend requests...

Frontend to Backend Request Ratio

During a performance test, how many frontend requests should there be for every backend request? We have never attempted to establish this metric. We would either have to account for requests the frontend makes to other sources (such as for images and analytics) or include them.

When included, this test has 105% to 202% more frontend requests compared to the 14 Feb 24 and 22 Jun 23 tests, respectively. Since SI was introduced in Mar 2024, we can't really compare to tests older than that.

Perhaps we can exclude those other requests. We asked QA if their test logs can tell us how many middle tier requests there were, ideally by status code. (Middle tier requests have paths beginning with "/api/" or "/data/".)

Middle Tier to Backend Request Ratio

@gigamorph was able to determine there were 363,788 middle tier requests during the test via:

SELECT  COUNT(*) FROM lux_cloudfront_tst
WHERE date = DATE '2024-06-07' AND time > '17:02:00' AND time < '19:16:30'
AND (cs_uri_stem LIKE '/api/%' OR cs_uri_stem LIKE '/data/%');

Given most if not all middle tier requests result in multiple backend requests, neither of us would expect more middle tier requests than backend requests, yet that's what the counts look like as there were only 307,511 backend requests. We do not yet know why.

Middle tier request counts by status code:

Status code Requests Percent Requests Percent
0 66,949 18.40%
200 178,199 48.98%
304 9,700 2.67%
Sub-total: through 300s 254,848 70.05%
400 1,218 0.33%
404 755 0.21%
500 12 0.00%
502 485 0.13%
503 106,465 29.27%
504 5 0.00%
Sub-total: 400s and 500s 108,940 29.95%
Totals 363,788 363,788

Questions:

  1. Were requests with status code 0 okay? I'm leaning towards yes. From https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html (h/t Seong), "...indicates that the viewer closed the connection before the server could respond to the request." This could be an indication of a VU transaction ending and not waiting for the rest of the transaction's responses yet still considering the transaction successful.
  2. Why was 304 (not modified) returned for nearly 10K requests when caching should be disabled? Answered here!
  3. Why was the performance test's error rate only 4.41% when 29.95% of the requests had a status response code in the 400s or 500s?

Sampling of requests with 503 response status codes:

image

kamerynB commented 2 weeks ago

@brent-hartwig SI was implemented in March. For almost every clickable action, there is an SI event being pushed. I'm not sure if that helps though.

brent-hartwig commented 2 weeks ago

Thanks, @kamerynB. It does. I updated my comment accordingly.

brent-hartwig commented 2 weeks ago

Why was 304 (not modified) returned for nearly 10K requests when caching should be disabled?

Seong researched and provided the following explanation. I agree we do not need to pursue this. In this test, they only accounted for 3% of the middle tier responses.

The node express library that middle tier uses as the web middleware, automatically generates and attaches an etag to each response. ETag is basically hash of the response data that enables the client to determine if the content has changed since the last request. Any well-established client including the web browsers will include an "If-None-Match: ${etag}" header in its requests telling the server to only send the full response (w/ 200) if the etag for the new response is not the previous one that the client(browser) cache currently has. Otherwise the server will just send back the 304 code with an empty body to save the traffic load.

So, that's why you see a lot of 304s in the browser network console. And the thing is, CloudFront and the web cache (Varnish) will each likely do the same thing, so if we need to (I doubt we need), it is not easy to disable all this 304 responses.

brent-hartwig commented 1 week ago

Decisions on executive summary items from conversation with @jffcamp and @prowns today:

brent-hartwig commented 1 week ago

Closing this ticket as all findings have been "acknowledged, understood, issue submitted, or resolved". For the most part, links were added to #181 comments that continue the associated investigation or links to new tickets were added.