project-lux / lux-middletier

Node backend for LUX frontend (a.k.a. middle tier)
Apache License 2.0
2 stars 0 forks source link

Improve backend request distribution #67

Closed brent-hartwig closed 2 weeks ago

brent-hartwig commented 3 months ago

[!NOTE] The first attempt at this ticket was implemented in #78, when the middle tier was updated to send /advancedSearchConfig and /dataConstants requests to request-group-2. The change was made to help performance test results. The change is neither expected to help nor hurt in production as these requests will not reach the middle tier due to the webcache.

Problem Description: During performance tests, one of two application server request queues can be full for a sustained period, likely explain thousands of 504 responses per second from the MarkLogic load balancer which we believe the data service proxies are retrying. We believe the data service proxies are mostly successful since the web cache load balancer peaked around 100 per second. If we shift redirect some requests from the application server whose queue fills up to the one that doesn't, the stack may be able to service more requests in a shorter period of time and have fewer 504s to process.

During 7 Jun 24's performance test, one queue was full 82% of the time the metric was recorded:

image

Expected Behavior/Solution: Increase traffic to the apparently underutilized MarkLogic application server. How best to do this is yet to be determined. Ideas and considerations thought of thus far (not intended to be mutually-exclusive):

  1. Question whether the objective is to cater to the current version of the performance test or more closer to present-day production usage patterns. Additional analysis may be required for the latter, but could be informed by frontend analytics and production's backend logs.
  2. During 7 Jun 24's performance test, CPU utilization was sustained at an elevated level. We have some room, but not a ton:

image

  1. Question if requests should continue to be routed by request type and if so, which request type(s) will be changing application servers.
  2. Question whether the maximum number of threads should change. Perhaps some should shift from one app server to another. The maximum number of threads factors into the maximum queue size: maximum per node number of nodes 2.

Note we have tried nos. 3 and 4 before, specifically scenarios K, M, N, and P. We did not end up selecting those configuration; however, much has changed since then.

Current configuration (Scenario J):

Breakdown of request by type and duration from the 7 Jun 24 performance test (https://github.com/project-lux/lux-marklogic/issues/162):

Endpoint < 10 millis < 100 millis < 1 second < 3 seconds 3-10 seconds > 10 seconds lux-request-group-1 lux-request-group-2 Total
Search 2.25% 32.81% 80.93% 100.00% 0.00% 0.00% 2,758 2,758
Related Lists 0.00% 0.00% 47.31% 92.65% 5.49% 1.87% 911 911
Facets 5.56% 26.83% 88.57% 95.54% 2.79% 1.66% 32,348 32,348
Search Will Match 0.71% 15.25% 60.79% 82.57% 2.18% 15.25% 22,760 22,760
Search Estimate 5.30% 35.37% 98.84% 100.00% 0.00% 0.00% 15,385 15,385
Document 97.89% 99.98% 100.00% 100.00% 0.00% 0.00% 169,433 169,433
Other endpoints 51,416 51,416
Total Requests 291,342 3,669 295,011

Note there were fewer than expected backend requests during the 7 Jun 24 performance test, and we may not yet know why.

Requirements: See above.

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

~- [ ] Wireframe/Mockup - Mike~

UAT/LUX Examples: All endpoints should be tested to ensure they reach the intended backend application server. Performance test recommended as well.

Dependencies/Blocks: This issue is neither dependent on nor blocking another.

Related Github Issues: https://github.com/project-lux/lux-marklogic/issues/162

Related links: None.

Wireframe/Mockup: Not needed.

brent-hartwig commented 3 months ago

FYI @roamye, @clarkepeterf, @gigamorph

brent-hartwig commented 2 months ago

Based on a team conversation today, the decision on whether to implement this ticket is blocked on upcoming discussions on how to perform a performance test. cc: @jffcamp, @prowns, @roamye

roamye commented 1 month ago

@brent-hartwig - this is blocked by the discussion or something else? this also needs to go to prioritization review as it skipped over the uat process.

brent-hartwig commented 1 month ago

@roamye, it is blocked pending a discussion / decision. My vote is to just close this ticket. Only after we figure out how the performance test can be executed to better represent production (e.g., more varied data with middle tier cache enabled) will we be able to identify and prioritize bottlenecks. Those bottlenecks may or may not be removed by changing how the middle tier utilizes the two backend app servers. Further due to facet request pagination, ML 11.3 improvements, and that the backend is never pushed as hard as it is during a performance test, we may be able to go back to one app server, which would simplify the environment a little.

cc: @clarkepeterf, @jffcamp, @prowns

roamye commented 1 month ago

UAT 8/26: This will be brought up in the performance test discussion this Wednesday. Will determine if this is closed/opened then.

@prowns to add to agenda.

roamye commented 3 weeks ago

@prowns -

I looked over the performance test discussion and this was not part of the agenda. It is unclear whether this ticket should remain open or closed. Should this be brought up in the IT Team Meeting to discuss?

brent-hartwig commented 3 weeks ago

I propose close based on our intent to revamp the performance test and only then seek out bottlenecks. We can reopen this ticket if it turns out to be one of the bottlenecks.

clarkepeterf commented 2 weeks ago

I concur. with @brent-hartwig

jffcamp commented 2 weeks ago

As do I

Jeffrey Campbell Pronouns: he/him/his

Phone: 203-432-8554 Cell: 475-201-5873

From: Peter Clarke @.> Sent: Wednesday, September 11, 2024 3:40 PM To: project-lux/lux-middletier @.> Cc: Campbell, Jeffrey @.>; Mention @.> Subject: Re: [project-lux/lux-middletier] Improve backend request distribution (Issue #67)

I concur. with @brent-hartwighttps://github.com/brent-hartwig

- Reply to this email directly, view it on GitHubhttps://github.com/project-lux/lux-middletier/issues/67#issuecomment-2344560579, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A4H4TPFH42A5AJRVLCDNKTLZWCMCFAVCNFSM6AAAAABJE5EZ56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGU3DANJXHE. You are receiving this because you were mentioned.Message ID: @.**@.>>

roamye commented 2 weeks ago

Approved by UAT 9/12 to close.