Improve backend request distribution

brent-hartwig commented 3 months ago

[!NOTE] The first attempt at this ticket was implemented in #78, when the middle tier was updated to send /advancedSearchConfig and /dataConstants requests to request-group-2. The change was made to help performance test results. The change is neither expected to help nor hurt in production as these requests will not reach the middle tier due to the webcache.

Problem Description: During performance tests, one of two application server request queues can be full for a sustained period, likely explain thousands of 504 responses per second from the MarkLogic load balancer which we believe the data service proxies are retrying. We believe the data service proxies are mostly successful since the web cache load balancer peaked around 100 per second. If we shift redirect some requests from the application server whose queue fills up to the one that doesn't, the stack may be able to service more requests in a shorter period of time and have fewer 504s to process.

During 7 Jun 24's performance test, one queue was full 82% of the time the metric was recorded:

Expected Behavior/Solution: Increase traffic to the apparently underutilized MarkLogic application server. How best to do this is yet to be determined. Ideas and considerations thought of thus far (not intended to be mutually-exclusive):

Question whether the objective is to cater to the current version of the performance test or more closer to present-day production usage patterns. Additional analysis may be required for the latter, but could be informed by frontend analytics and production's backend logs.
During 7 Jun 24's performance test, CPU utilization was sustained at an elevated level. We have some room, but not a ton:

Question if requests should continue to be routed by request type and if so, which request type(s) will be changing application servers.
Question whether the maximum number of threads should change. Perhaps some should shift from one app server to another. The maximum number of threads factors into the maximum queue size: maximum per node number of nodes 2.

Note we have tried nos. 3 and 4 before, specifically scenarios K, M, N, and P. We did not end up selecting those configuration; however, much has changed since then.

Current configuration (Scenario J):

lux-request-group-1 on port 8003: The middle tier is expected to send all requests here except search and relatedList requests. Maximum of 6 concurrent requests.
lux-request-group-2 on port 8004: The middle tier is expected to send all search and relatedList requests to this application server. Maximum of 12 concurrent requests.

Breakdown of request by type and duration from the 7 Jun 24 performance test (https://github.com/project-lux/lux-marklogic/issues/162):

Endpoint	< 10 millis	< 100 millis	< 1 second	< 3 seconds	3-10 seconds	> 10 seconds	lux-request-group-1	lux-request-group-2	Total
Search	2.25%	32.81%	80.93%	100.00%	0.00%	0.00%		2,758	2,758
Related Lists	0.00%	0.00%	47.31%	92.65%	5.49%	1.87%		911	911
Facets	5.56%	26.83%	88.57%	95.54%	2.79%	1.66%	32,348		32,348
Search Will Match	0.71%	15.25%	60.79%	82.57%	2.18%	15.25%	22,760		22,760
Search Estimate	5.30%	35.37%	98.84%	100.00%	0.00%	0.00%	15,385		15,385
Document	97.89%	99.98%	100.00%	100.00%	0.00%	0.00%	169,433		169,433
Other endpoints							51,416		51,416
Total Requests							291,342	3,669	295,011

Note there were fewer than expected backend requests during the 7 Jun 24 performance test, and we may not yet know why.

Requirements: See above.

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

~- [ ] Wireframe/Mockup - Mike~

[ ] Committee discussions - Sarah
[ ] Feasibility/Team discussion - Sarah ~- [ ] Backend requirements - None~ ~- [ ] Frontend requirements- None~
[ ] Are new regression tests required for QA - Amy
[ ] Determine if performance test should be run.
[ ] Questions
List of questions for discussions. Answers should be documented within the issue.

UAT/LUX Examples: All endpoints should be tested to ensure they reach the intended backend application server. Performance test recommended as well.

Dependencies/Blocks: This issue is neither dependent on nor blocking another.

Related Github Issues: https://github.com/project-lux/lux-marklogic/issues/162

Related links: None.

Wireframe/Mockup: Not needed.

brent-hartwig commented 3 months ago

FYI @roamye, @clarkepeterf, @gigamorph

brent-hartwig commented 2 months ago

Based on a team conversation today, the decision on whether to implement this ticket is blocked on upcoming discussions on how to perform a performance test. cc: @jffcamp, @prowns, @roamye

roamye commented 1 month ago

@brent-hartwig - this is blocked by the discussion or something else? this also needs to go to prioritization review as it skipped over the uat process.

brent-hartwig commented 1 month ago

@roamye, it is blocked pending a discussion / decision. My vote is to just close this ticket. Only after we figure out how the performance test can be executed to better represent production (e.g., more varied data with middle tier cache enabled) will we be able to identify and prioritize bottlenecks. Those bottlenecks may or may not be removed by changing how the middle tier utilizes the two backend app servers. Further due to facet request pagination, ML 11.3 improvements, and that the backend is never pushed as hard as it is during a performance test, we may be able to go back to one app server, which would simplify the environment a little.

cc: @clarkepeterf, @jffcamp, @prowns

roamye commented 1 month ago

UAT 8/26: This will be brought up in the performance test discussion this Wednesday. Will determine if this is closed/opened then.

@prowns to add to agenda.

roamye commented 3 weeks ago

@prowns -

I looked over the performance test discussion and this was not part of the agenda. It is unclear whether this ticket should remain open or closed. Should this be brought up in the IT Team Meeting to discuss?

brent-hartwig commented 3 weeks ago

I propose close based on our intent to revamp the performance test and only then seek out bottlenecks. We can reopen this ticket if it turns out to be one of the bottlenecks.

clarkepeterf commented 2 weeks ago

I concur. with @brent-hartwig

jffcamp commented 2 weeks ago

As do I

Jeffrey Campbell Pronouns: he/him/his

Phone: 203-432-8554 Cell: 475-201-5873

From: Peter Clarke @.> Sent: Wednesday, September 11, 2024 3:40 PM To: project-lux/lux-middletier @.> Cc: Campbell, Jeffrey @.>; Mention @.> Subject: Re: [project-lux/lux-middletier] Improve backend request distribution (Issue #67)

I concur. with @brent-hartwighttps://github.com/brent-hartwig

- Reply to this email directly, view it on GitHubhttps://github.com/project-lux/lux-middletier/issues/67#issuecomment-2344560579, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4H4TPFH42A5AJRVLCDNKTLZWCMCFAVCNFSM6AAAAABJE5EZ56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGU3DANJXHE. You are receiving this because you were mentioned.Message ID: @.**@.>>

roamye commented 2 weeks ago

Approved by UAT 9/12 to close.

project-lux / lux-middletier

Improve backend request distribution #67