open-policy-agent / opa

Open Policy Agent (OPA) is an open source, general-purpose policy engine.
https://www.openpolicyagent.org
Apache License 2.0
9.52k stars 1.32k forks source link

Bundle Service API Implementation using Worker Task Pattern #6154

Closed plingam-infy closed 1 year ago

plingam-infy commented 1 year ago

Prakash Lingam: OPA invokes Bundle Service API to pull policy bundles.

Bundle Download is implemented in GO as plugins in OPA. Periodic download of Bundle (short polling) is implemented in a loop, which listens to Timer elapse(configurable). If Timer elapses, bundle download happens in background. Depending on client needs, it could result an increase in polling frequency. Lesser the needs of the acceptable latency from client, more the polling frequency, more the resources consumed on the server/network side. Long Polling addresses this latency and resource issues, but not necessarily at scale.

Long Polling makes HTTP request to a server and keep the TCP/IP connection open till server responds. The server would respond with an update or timeout, as applicable.

I have not worked on Long Polling with OPA but my experience with Long Polling is empirical. Long Polling limitation with Performance and Scalability is well known. Reverse proxy, gateway, WAF won't be able to scale easily if there are many long polling connections.

Question to Ponder: If we have to deploy OPA instances at scale and software infrastructure like Reverse proxy, gateway, WAF, can OPA scale with long polling(especially when there are too many bundles to download)?

GO has libraries to support Long-polling. OPA is leveraging these libs (as OPA is implemented in GO) for policy bundle download. OPA has mentioned in the documentation https://datatracker.ietf.org/doc/html/rfc6202#section-2.2 , which clearly states the above long polling issues. One more link, though this is for another platform: https://ably.com/topic/long-polling

What is the underlying problem you're trying to solve?

Prakash Lingam: HTTP Long Polling Issues are summarized below:

i. During the connection setup, request would go through intermediaries, which will aid in scalability, in the form of gateways, proxy to enable load-balancing, caching, monitoring, SSL handling etc., Hence scalability would be an issue for maintaining open connection across intermediaries. ii. Though Proxies support long polling, not all proxies can buffer response on behalf of server for client and it incurs overhead in resources (CPU, memory, Network Bandwidth) across all intermediaries for maintaining open connections. These factors would lead to performance impact. iii. Scalability and Performance will be an issue if there is a need to maintain sticky sessions for long polling client. iv. Surge in the client-side and server-side load would have detrimental effect on both parties performance. v. For every new connection, headers and cookies must be sent on each request. vi. Timeouts need to be substantially and relatively high for long polling vii. Caching implemented at the intermediaries like proxy/gateway would not go well with Long Polling. Typically, cache-control is set to ‘no-cache’, max-age=0

Describe the ideal solution

Prakash Lingam:
Worker Task Pattern, which is called as Queue-Job, can be used to address scalability, availability, and maintainability. Hence, Worker Task,'Queue Worker' and Queue-Job are synonymous. we will use 'Worker Task'

While Worker Task Pattern may not address all above issues, it does alleviate above issues and it helps in scalability, performance gain by not needing long timeouts, caching etc., Assumption for this pattern is Bundle Service API should support POST and implements Worker Task Pattern in asynchronous way.

Describe a "Good Enough" solution

Prakash Lingam: Narrated below are the few aspects to consider while implementing this pattern.

1.Worker Resource URI:

Worker Resource URI can be sent to OPA client invoking Bundle Service API in two ways.

i. The Content-Location header could be populated with the Worker Resource URI on initial POST request. Initial POST request could validate the request coming from OPA Client and return this header with 202.

ii. It could be returned in the link object, the value of which would point to Worker Resource URI

Combination of the above two approaches also could be implemented.
The status of the worker resource should be returned with appropriate status codes and friendly messages like “Policy Bundle for this document path is in progress etc.,”

  1. Retry-After Header:

Since the initial request returns a Worker Resource URI, it would be an icing on the cake from OPA Client perspective to get Retry-After header in the response to the GET on the worker resource. This header would provide OPA Client when to poll or even not to poll at all. It would save network I/O(s).

If the worker task is done, it would return a status 303 with link in Location header

  1. Caching of Worker Resource:

It make sense to cache worker resources on several service instances where OPA is deployed.

Depending on the use case requirement, worker resource can be used to scope to the entire scope of OPA Policy and Data or Subset of OPA Policy and Data, if policy and data is coming from multiple sources. For the latter, each worker resource could point to its subset of policy and data, pertaining to document path defined in roots element in manifest.

The Etag Header must be set to a hashed representation of the cacheable resource state in the response, along with Cache-Control header, which describes cacheability of resource. If the worker resource is done/finished, OPA policy and data can be cached in service instance until deletion. In other cases, Cache-Control Header can ideally be set to max-age <= Retry-After header value.

Trade-Off for Worker Task Pattern is that OPA client may not be able to get real-time policy updates as with Long Polling. Client using this pattern may have to content with near real-time updates but benefits outweigh trade-offs in terms of scalability, performance etc.,

Question to Ponder: What is the rationale behind choosing long polling for policy bundle download from centralized server and storing in in-memory store in OPA instances? IMHO, Just because OPA is implemented in GO, leveraging HTTP long polling may not be a justifiable reason. Fallback as Short-polling may not serve good for extreme scaling cases

If the Bundle Server supports Worker Task Pattern, remove polling in the bundle download configuration from 'HTTP Long Polling' section (https://www.openpolicyagent.org/docs/latest/management-bundles/)

My suggestion is to implement WorkerTask Pattern as an another option(in one of the future release of OPA), instead of having options only in long polling and short-polling as fallback(if long polling is not supported). Hence, Worker Task pattern should be supported in addition to Short/Long polling

Bundle Service API URL is in the form of HTTP GET but GET request does not necessarily warrant an async approach. Hence, GET can be modeled/overloaded as POST .The GET Operation from OPA client to remote HTTP server can be overloaded as POST method by either i) Intermediaries like API Gateway/Proxy or ii) Through Configuration change in the HTTP server

Alternatively, instead of overloading GET as POST method, OPA Client can invoke Bundle Service API as POST, in which case no intermediaries are needed to overload, but method signature of Bundle Service API is changed from GET to POST

-->

Additional Context

Worker Task, which is called as "Queue Worker" in Drupal , shows how to pull the content from centralized server and share updates to the clients. Queue also supports background async processing Excerpts from https://www.valuebound.com/resources/blog/drupal-queue-worker-api

"Queue is more efficient and can handle resource-intensive tasks. The API also allows you to revert the item back to queue if any failure occurs. Most importantly, you can run multiple queues without interrupting other work."

One more link about QueueWorker https://www.drupal.org/project/drupal/issues/3242216

ashutosh-narkar commented 1 year ago

@plingam-infy thanks for the detailed writeup. I had few questions:

While Worker Task Pattern may not address all above issues, it does alleviate above issues and it helps in scalability, performance gain by not needing long timeouts, caching etc.,

I want to understand what the scalability and performance requirements are we talking about here? Have you already tried out the existing short/long polling solutions and found scalability and performance issues for your use case? Some concrete numbers would be helpful to allow us to make improvements in the existing mechanisms.

Trade-Off for Worker Task Pattern is that OPA client may not be able to get real-time policy updates as with Long Polling.

One of the advantages of long polling in combination with Delta bundles is the fast propagation of data so looks like that may not be feasible in this approach.

OPA Client can invoke Bundle Service API as POST

This would need changes in the bundle service as well. We've tried to keep the bundle service API very simple so it's easy to set it up and start serving bundles to OPA.

The main point you raise is scalability and performance limitations of the current approach and it's possible that Worker Task Pattern may help with that. The changes would be pretty significant so we really need to understand the shortcomings of the current approach in terms of real numbers and also clearly show how the Worker Task Pattern actually improves on that.

On thing I can imagine doing to test this out is to create a component that implements the Worker Task Pattern client and server to consume and server bundles respectively and then the client uses OPA's REST API to push data/policy into OPA. You could even do this today and not use the long/short polling.

plingam-infy commented 1 year ago

Hi @ashutosh-narkar ,

I have not implemented polling in OPA., In a typically large-scale enterprise system, we will have to take into account all intermediaries like WAF,CDN, Reverse Proxy, API Proxy . Long polling connections could result in timeout(even if the timeout is extended out to maximum- which the intermediaries don't prefer).Short polling waste resources in connection setup. Timeout also will result in 200 with long polling - AFAIK). further , you cannot set 'cache-control' header with suitable value for long polling.

Delta bundles are only for 'data' jsons., not for 'policy' bundles.(AFAIK). I told initial request is POST. Only, Subsequent , worker resources are GET.

In-memory queue can be attached with bundle server, which the worker picks and run it as separate process on server(enough context must be there to process request, as from client perspective, it would be fire and forget but it is aware of parent POSTrequest ). Even, if the policy is coming from multiple sources, have in-memory queue or move queue to separate component outside bundle server. Aggregation of all worker response(GET) should be done on application side.

If this issue arises with polling, you can consider my feature request(Worker Task Pattern) or reach out to me in future. Thanks

ashutosh-narkar commented 1 year ago

Thanks for the input @plingam-infy. It's good to keep this option in mind if we see significant scalability and performance limitations with the current mechanisms. Thanks for bringing this up!