optimizely / agent

Agent service for Optimizely Feature Experimentation and Optimizely Full Stack (legacy)
Apache License 2.0
31 stars 26 forks source link

[BUG] Agent is not able to recover from 403 errors during SyncConfig #404

Closed brunoguic closed 10 months ago

brunoguic commented 11 months ago

Is there an existing issue for this?

Agent Version

2.7.0

Current Behavior

After an instance of PollingProjectConfigManager suffers a 403 error, the Agent will not be able to identify, nor recover from this bad state.

Expected Behavior

The Agent could identify and try a new authentication, or kill itself.

Steps To Reproduce

  1. Begin by initializing the Agent with the correct SDK configurations.
  2. Allow a few minutes for the PollingProjectConfigManager.SyncConfig() function to ensure that it successfully retrieves valid data.
  3. In the Optimazelhy servers (recognizing that this scenario may be challenging to replicate but has occurred in a production environment), intentionally invalidate the Token associated with this specific Agent. Consequently, the Optimazely server will start returning HTTP 403 (Forbidden) responses for every subsequent attempt to execute PollingProjectConfigManager.SyncConfig().
  4. At this stage, it's important to note that the Agent will be unable to retrieve new data, and recovery from this situation will be difficult or impossible.

Go Version

1.18

Link

No response

Logs

No response

Severity

Affecting users

Workaround/Solution

After we identify the error, we can restart the Agent to force a new authentication.

Recent Change

No response

Conflicts

No response

brunoguic commented 11 months ago

I'm unsure if the error should be reported to the SDK team. Upon inspecting the SDK code, the following observations can be made:

  1. Within the PollingProjectConfigManager.SyncConfig function, the error is handled correctly, and it populates the err variable with the 403 error code.
  2. However, when this error is exposed to the Agent through PollingProjectConfigManager.GetConfig, the err variable is not exposed. This omission occurs because, after a series of successful requests, the projectConfig is no longer nil.
  3. Consequently, the Agent remains unaware of whether the SyncConfig process is functioning correctly.
pulak-opti commented 11 months ago

Hi @brunoguic Thanks for creating the issue here. Currently the same team maintains the Agent and SDK. We'll look into this and get back to you.

pulak-opti commented 10 months ago

Hi @brunoguic We have decided to improve the error handling to convert the previous warning logs to error logs. For running Agent in Kubernetes, we are also recommending to setup monitoring & alerting system to capture such error logs ASAP.