permitio / opal

Policy and data administration, distribution, and real-time updates on top of Policy Agents (OPA, Cedar, ...)
https://opal.ac
Apache License 2.0
4.14k stars 156 forks source link

OPAL Client Does Not Retry Fetching Data in External Data Sources Mode #623

Closed h9ing closed 1 week ago

h9ing commented 2 months ago

Hello,

I am using the OPAL client in a Kubernetes environment alongside an OPA sidecar. My OPAL client is configured to use external data sources mode. I have noticed that when fetching data exceeds a 10-second timeout, there is no retry mechanism to fetch the data until the next trigger event occurs.

I have reviewed the source code and it seems that by default, there should be a retry mechanism. Unfortunately, it’s not working as expected. default setting: { "wait_strategy": "random_exponential", "max_wait": 10, "attempts": 5, "wait_time": 1 }

I have also tried setting the DATA_UPDATER_CONN_RETRY in the Kubernetes container's environment variables as well as in /usr/.env, but it does not take effect. Below are my settings:

K8S ENV: DATA_UPDATER_CONN_RETRY={"wait_strategy":"random_exponential","max_wait":5,"attempts":10,"wait_time":3} OPAL_SERVER_URL=http://opal-server.opal.svc.cluster.local OPAL_POLICY_STORE_URL=http://localhost:8181

/usr/.env DATA_UPDATER_CONN_RETRY={"wait_strategy":"random_exponential","max_wait":5,"attempts":10,"wait_time":3} OPAL_DATA_TOPICS=test-topic OPAL_CLIENT_TOKEN=eyxxxxxxxx

These are my OPAL client logs: https://gist.github.com/h9ing/caee9660adbb3c14624bdb828d697090

How can I configure the OPAL client to retry fetching data?

Thank you!

OPAL version OPAL Client Version: opal-client-standalone:0.7.7 OPAL Server Version: opal-server:0.7.7

obsd commented 2 months ago

Hi @h9ing, thanks for the detailed report I will ask someone from the team to take a look, meanwhile, I would start by setting the initial timeout to a higher number. You can set it by setting the env var OPAL_FETCHING_CALLBACK_TIMEOUT to a higher number

roekatz commented 2 months ago

Hi @h9ing,

  1. Make sure you prefix all configuration envars with OPAL_ (should be OPAL_DATA_UPDATER_CONN_RETRY).
  2. As @obsd - you probably want to increase OPAL_FETCHING_CALLBACK_TIMEOUT as that's the overall timeout opal client gives the fetcher. Within that timeout - DATA_UPDATER_CONN_RETRY configures the retrying mechanism.

Closing for now - but let us know if that helps or you're still having an issue

h9ing commented 2 months ago

Hi @roekatz

Actually, I had tried using OPAL_DATA_UPDATER_CONN_RETRY before. Now, I am using both OPAL_FETCHING_CALLBACK_TIMEOUT and OPAL_DATA_UPDATER_CONN_RETRY.

Here are the environment variables I have set: OPAL_FETCHING_CALLBACK_TIMEOUT=20 OPAL_DATA_UPDATER_CONN_RETRY={"wait_strategy":"random_exponential","max_wait":5,"attempts":10,"wait_time":3} OPAL_SERVER_URL=http://opal-server.opal.svc.cluster.local OPAL_POLICY_STORE_URL=http://localhost:8181

While the timeout duration has increased, the retry mechanism still does not function correctly after the timeout. The error message remains the same, and it stops after failure until the next trigger event.

The official documentation does not provide examples for DATA_UPDATER_CONN_RETRY. I deduced the value from the source code, but I am not sure if it is correct. {"wait_strategy":"random_exponential","max_wait":5,"attempts":10,"wait_time":3}

image

Could you provide further guidance on this?

Thank you!

roekatz commented 1 month ago

@h9ing How would you expect the retries to behave? and how do they actually behave?

h9ing commented 1 month ago

@roekatz

I expect that after failing to fetch policy data, the system should retry fetching the policy data.

For example, after failing to get the data-sources configuration, it will retry until it successfully obtains the data-sources configuration.

Here are the detailed logs: https://gist.github.com/h9ing/b70a7325a1f1053a52965c51032b3083

roekatz commented 1 month ago

@h9ing What value did you give to OPAL_FETCHING_CALLBACK_TIMEOUT? Just making sure we're on the same page - this value is the outer scope timeout for waiting on the entire fetch attempt - including retries.

Hmm can it maybe be that the data sources request doesn't fail but rather just stays stuck until the timeout? if you can also share DEBUG level logs - that might be helpful

h9ing commented 1 month ago

@roekatz I apologize for any misunderstanding. This is a test scenario where I intentionally increased the policy data server response time to test the behavior after a fetch data timeout. Although the probability of a fetch data timeout occurring is not very high, we have experienced it multiple times.

I believe increasing the timeout value can solve most of the issues, but having a retry mechanism after a timeout would be even better!

My OPAL_FETCHING_CALLBACK_TIMEOUT is set to 20. Below are my debug level logs: https://gist.github.com/h9ing/41a6a19b5112aced72f33fdc97db520a

Additionally, I would like to confirm: when using external data sources mode, should the fetch policy automatically retry after a failure according to your design? Or have I misunderstood the meaning of OPAL_DATA_UPDATER_CONN_RETRY?

Thank you!

roekatz commented 1 month ago

I understand now.

  1. I believe aiohttp's default timeout is 5 minutes (for the http request itself) - only if that fails (and an exception is raised) would the retry mechanism get into action. And OPAL_FETCHING_CALLBACK_TIMEOUT is the total upper limit the client gives the fetcher as a whole (including retries). So in order to see a retry - OPAL_FETCHING_CALLBACK_TIMEOUT would have to be more than 300 seconds... But is there a reason you would prefer the client to give up the fetch with a smaller timeout and retry again?

  2. We've recently introduced an option to use httpx as the fetching client - there I think the default timeout is 5 secs. So maybe this would help you achieve what you want? (enable with OPAL_HTTP_FETCHER_PROVIDER_CLIENT=httpx

  3. Regarding external data sources mode, I assume you mean the use of external_source_url, right? Basically behavior of the client is identical in that case. Only difference is server would redirect (HTTP 307) the client to the external URL upon request of the data source config entries.

h9ing commented 1 month ago

HI @roekatz

Thank you for your response.

Yes, I am using external_source_url! I mainly want to ensure that there is a retry mechanism after a timeout. I have previously encountered several instances where, after fetching data timed out, there was no retry, leading to an empty OPA policy store.

I also conducted a few tests:

  1. As you mentioned, the fetch data operation automatically retries after a failure (other than timeout).
  2. Regardless of whether it's aiohttp or httpx, there is a retry mechanism after a timeout. However, based on the validation results, even if the policy data server returns data, it does not appear to be stored in the OPA policy store.

Tests:

roekatz commented 1 month ago

Hi @h9ing,

No matter the value of OPAL_FETCHING_CALLBACK_TIMEOUT, once it timeouts the data updater gives up on the fetch forever. It's probably confusing that the fetcher still print logs after that timeout - that's because the data updater doesn't cancel the async http fetcher task. But again - no one waits for it to return anymore and the data won't be processed. That http fetcher task is the one doing retries on error, and as I said on previous comment, a timeout error would be raised after 5 mins with aiohttp or after 5 secs with httpx.

Looking at the logs you've sent, I actually don't see any issue (in the third case - the fetcher is successful only after the 10 secs OPAL_FETCHING_CALLBACK_TIMEOUT is reached). I'm also still unsure why would you want the http fetch to timeout & retry when the server is slow (rather than just keep waiting for the response to complete).

Anyway, try both using httpx and setting OPAL_FETCHING_CALLBACK_TIMEOUT to a long duration. I believe that will give you what you're looking for.

roekatz commented 1 week ago

@h9ing Closing the issue, LMK if you still think something doesn't work correctly