project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Investigate failures experienced during performance/stress test (from 965) #34

Open gigamorph opened 4 months ago

gigamorph commented 4 months ago

Impacted MarkLogic processes auto recover, holding up the other nodes for approximately 20 seconds. Multiple retry mechanisms reduce the likelihood of failed requests. Does not significantly increase the performance test error rate. Not a current production issue.

In Feb 2024, we ran the performance test against an 11.2 nightly build and were able to reproduce this issue. Logs and other artifacts were provided to Support via support ticket no. 35746. Our ticket: #1165.

Problem Description: During all recent performance tests, the v8 engine crashed multiple times on every node. James Kerr, of MarkLogic Engineering, requests that we report all v8 engine crashes regardless of context. This could be very beneficial as they may figure out one or more issues while we continue our investigation in parallel .

This ticket is for us to open such a support ticket and serve up the information enabling them to look into the crashes.

James confirmed that a v8 engine crash leads to a MarkLogic segfaults; although, let's keep an eye out for unrelated segfaults.

Expected Behavior/Solution: TBD

Requirements:

Needed for promotion: TBD

If an item on the list is not needed, it should be crossed off but not removed.

UAT/LUX Examples: TBD

Dependencies/Blocks: TBD

Related Github Issues:

Related links:

Wireframe/Mockup: TBD

cc: @jac237, @prowns, @guoxinji

brent-hartwig commented 3 months ago

@jffcamp, @prowns, @xinjianguo, @clarkepeterf,

ML Engineering has paused their efforts on investigating the instance we submitted from Feb 2024's performance test. They stand a better chance at tracking down the cause in more controlled or isolated conditions. It is not practical for Engineering to replicate the LoadRunner performance test; thus, we're going for more isolated conditions, which is where the new monitoring test for messages containing "hung" come in. Should that test alert, let's learn what we can about the conditions and pass on the logs via Support ticket no. 35746. Second, next time we run the performance test, we should consider doing so without the custom error handler (#89).

brent-hartwig commented 2 months ago

Additional requests from ML Engineering for the next performance test: