Closed plingam-infy closed 1 year ago
@plingam-infy thanks for the detailed writeup. I had few questions:
While Worker Task Pattern may not address all above issues, it does alleviate above issues and it helps in scalability, performance gain by not needing long timeouts, caching etc.,
I want to understand what the scalability and performance requirements are we talking about here? Have you already tried out the existing short/long polling solutions and found scalability and performance issues for your use case? Some concrete numbers would be helpful to allow us to make improvements in the existing mechanisms.
Trade-Off for Worker Task Pattern is that OPA client may not be able to get real-time policy updates as with Long Polling.
One of the advantages of long polling in combination with Delta bundles is the fast propagation of data so looks like that may not be feasible in this approach.
OPA Client can invoke Bundle Service API as POST
This would need changes in the bundle service as well. We've tried to keep the bundle service API very simple so it's easy to set it up and start serving bundles to OPA.
The main point you raise is scalability and performance limitations of the current approach and it's possible that Worker Task Pattern may help with that. The changes would be pretty significant so we really need to understand the shortcomings of the current approach in terms of real numbers and also clearly show how the Worker Task Pattern actually improves on that.
On thing I can imagine doing to test this out is to create a component that implements the Worker Task Pattern client and server to consume and server bundles respectively and then the client uses OPA's REST API to push data/policy into OPA. You could even do this today and not use the long/short polling.
Hi @ashutosh-narkar ,
I have not implemented polling in OPA., In a typically large-scale enterprise system, we will have to take into account all intermediaries like WAF,CDN, Reverse Proxy, API Proxy . Long polling connections could result in timeout(even if the timeout is extended out to maximum- which the intermediaries don't prefer).Short polling waste resources in connection setup. Timeout also will result in 200 with long polling - AFAIK). further , you cannot set 'cache-control' header with suitable value for long polling.
Delta bundles are only for 'data' jsons., not for 'policy' bundles.(AFAIK). I told initial request is POST. Only, Subsequent , worker resources are GET.
In-memory queue can be attached with bundle server, which the worker picks and run it as separate process on server(enough context must be there to process request, as from client perspective, it would be fire and forget but it is aware of parent POSTrequest ). Even, if the policy is coming from multiple sources, have in-memory queue or move queue to separate component outside bundle server. Aggregation of all worker response(GET) should be done on application side.
If this issue arises with polling, you can consider my feature request(Worker Task Pattern) or reach out to me in future. Thanks
Thanks for the input @plingam-infy. It's good to keep this option in mind if we see significant scalability and performance limitations with the current mechanisms. Thanks for bringing this up!
Prakash Lingam: OPA invokes Bundle Service API to pull policy bundles.
Bundle Download is implemented in GO as plugins in OPA. Periodic download of Bundle (
short polling
) is implemented in a loop, which listens to Timer elapse(configurable). If Timer elapses, bundle download happens in background. Depending on client needs, it could result an increase inpolling
frequency. Lesser the needs of the acceptable latency from client, more thepolling
frequency, more the resources consumed on the server/network side.Long Polling
addresses this latency and resource issues, but not necessarily atscale
.Long Polling
makes HTTP request to a server and keep the TCP/IP connection open till server responds. The server would respond with an update or timeout, as applicable.I have not worked on Long Polling with OPA but my experience with Long Polling is empirical. Long Polling limitation with Performance and Scalability is well known. Reverse proxy, gateway, WAF won't be able to scale easily if there are many long polling connections.
Question to Ponder: If we have to deploy OPA instances at scale and software infrastructure like Reverse proxy, gateway, WAF, can OPA scale with long polling(especially when there are too many bundles to download)?
GO has libraries to support Long-polling. OPA is leveraging these libs (as OPA is implemented in GO) for policy bundle download. OPA has mentioned in the documentation https://datatracker.ietf.org/doc/html/rfc6202#section-2.2 , which clearly states the above long polling issues. One more link, though this is for another platform: https://ably.com/topic/long-polling
What is the underlying problem you're trying to solve?
Prakash Lingam: HTTP Long Polling Issues are summarized below:
i. During the connection setup, request would go through intermediaries, which will aid in scalability, in the form of gateways, proxy to enable load-balancing, caching, monitoring, SSL handling etc., Hence scalability would be an issue for maintaining open connection across intermediaries. ii. Though Proxies support
long polling
, not all proxies can buffer response on behalf of server for client and it incurs overhead in resources (CPU, memory, Network Bandwidth) across all intermediaries for maintaining open connections. These factors would lead to performance impact. iii. Scalability and Performance will be an issue if there is a need to maintain sticky sessions forlong polling
client. iv. Surge in the client-side and server-side load would have detrimental effect on both parties performance. v. For every new connection, headers and cookies must be sent on each request. vi. Timeouts need to be substantially and relatively high for long polling vii. Caching implemented at the intermediaries like proxy/gateway would not go well with Long Polling. Typically, cache-control is set to ‘no-cache’, max-age=0Describe the ideal solution
Prakash Lingam:
Worker Task Pattern, which is called as
Queue-Job
, can be used to address scalability, availability, and maintainability. Hence,Worker Task
,'Queue Worker' andQueue-Job
are synonymous. we will use 'Worker Task'While
Worker Task Pattern
may not address all above issues, it does alleviate above issues and it helps in scalability, performance gain by not needing long timeouts, caching etc., Assumption for this pattern is Bundle Service API should supportPOST
and implementsWorker Task Pattern
in asynchronous way.Describe a "Good Enough" solution
Prakash Lingam: Narrated below are the few aspects to consider while implementing this pattern.
1.Worker Resource URI:
Worker Resource URI
can be sent to OPA client invoking Bundle Service API in two ways.i. The
Content-Location
header could be populated with theWorker Resource URI
on initial POST request. InitialPOST
request could validate the request coming from OPA Client and return this header with202
.ii. It could be returned in the link object, the value of which would point to
Worker Resource URI
Combination of the above two approaches also could be implemented.
The status of the
worker resource
should be returned with appropriate status codes and friendly messages like “Policy Bundle for this document path is in progress etc.,”Since the initial request returns a
Worker Resource URI
, it would be an icing on the cake from OPA Client perspective to getRetry-After
header in the response to theGET
on the worker resource. This header would provide OPA Client when to poll or even not to poll at all. It would save network I/O(s).If the worker task is done, it would return a status
303
with link inLocation
headerIt make sense to cache worker resources on several service instances where OPA is deployed.
Depending on the use case requirement,
worker resource
can be used to scope to the entire scope of OPA Policy and Data or Subset of OPA Policy and Data, if policy and data is coming from multiple sources. For the latter, eachworker resource
could point to its subset of policy and data, pertaining to document path defined inroots
element in manifest.The
Etag
Header must be set to a hashed representation of the cacheable resource state in the response, along withCache-Control
header, which describes cacheability of resource. If theworker resource
is done/finished, OPA policy and data can be cached in service instance until deletion. In other cases,Cache-Control
Header can ideally be set to max-age <=Retry-After
header value.Trade-Off for
Worker Task Pattern
is that OPA client may not be able to get real-time policy updates as withLong Polling
. Client using this pattern may have to content with near real-time updates but benefits outweigh trade-offs in terms of scalability, performance etc.,Question to Ponder: What is the rationale behind choosing long polling for policy bundle download from centralized server and storing in in-memory store in OPA instances? IMHO, Just because OPA is implemented in GO, leveraging HTTP long polling may not be a justifiable reason. Fallback as Short-polling may not serve good for extreme scaling cases
If the Bundle Server supports
Worker Task Pattern
, removepolling
in the bundle download configuration from 'HTTP Long Polling' section (https://www.openpolicyagent.org/docs/latest/management-bundles/)My suggestion is to implement WorkerTask Pattern as an another option(in one of the future release of OPA), instead of having options only in long polling and short-polling as fallback(if long polling is not supported). Hence, Worker Task pattern should be supported in addition to Short/Long polling
Bundle Service API URL is in the form of HTTP
GET
butGET
request does not necessarily warrant an async approach. Hence,GET
can be modeled/overloaded asPOST
.TheGET
Operation from OPA client to remote HTTP server can be overloaded asPOST
method by either i) Intermediaries like API Gateway/Proxy or ii) Through Configuration change in the HTTP serverAlternatively, instead of overloading
GET
asPOST
method, OPA Client can invoke Bundle Service API as POST, in which case no intermediaries are needed to overload, but method signature of Bundle Service API is changed fromGET
toPOST
-->
Additional Context
Worker Task, which is called as "Queue Worker" in Drupal , shows how to pull the content from centralized server and share updates to the clients. Queue also supports background async processing Excerpts from https://www.valuebound.com/resources/blog/drupal-queue-worker-api
"Queue is more efficient and can handle resource-intensive tasks. The API also allows you to revert the item back to queue if any failure occurs. Most importantly, you can run multiple queues without interrupting other work."
One more link about QueueWorker https://www.drupal.org/project/drupal/issues/3242216