open-policy-agent / opa

Open Policy Agent (OPA) is an open source, general-purpose policy engine.
https://www.openpolicyagent.org
Apache License 2.0
9.77k stars 1.36k forks source link

OPA not closing gracefully #6676

Open bdumpp opened 8 months ago

bdumpp commented 8 months ago

We are experiencing OPA pods logging errors during shutdown, even when it should be a graceful shutdown because a new pod is being deployed in an old pod's place. The error message we consistenly get when restarting pods is

{"level":"error","msg":"Status update failed: Post \"[http://slp-name/v1/status\](http://slp/v1/status/)": context canceled.","plugin":"status","time":"2024-04-04T11:38:45+02:00"}

We have noticed this in versions 0.60.0 and 0.62.1.

ashutosh-narkar commented 8 months ago

When a graceful shutdown time is provided, it's taken into account for stopping both the server and plugins. This error is coming from the status plugin while it must be uploading a status update.

stale[bot] commented 7 months ago

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

bdjgs commented 6 months ago

When running without setting the --shutdown-wait-period flag, the OPA will log the following when being shutdown.

{"level":"info","msg":"Status update sent successfully in response to decision log update.","plugin":"status","time":"2024-06-04T11:56:58Z"}
{"client_addr":"100.78.24.1:53480","level":"info","msg":"Received request.","req_id":9,"req_method":"GET","req_path":"/health","time":"2024-06-04T11:56:59Z"}
{"client_addr":"100.78.24.1:53480","level":"info","msg":"Sent response.","req_id":9,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.431618,"resp_status":200,"time":"2024-06-04T11:56:59Z"}
{"client_addr":"100.78.24.1:53496","level":"info","msg":"Received request.","req_id":10,"req_method":"GET","req_path":"/health","time":"2024-06-04T11:57:01Z"}
{"client_addr":"100.78.24.1:53496","level":"info","msg":"Sent response.","req_id":10,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.348637,"resp_status":200,"time":"2024-06-04T11:57:01Z"}
{"level":"info","msg":"Shutting down...","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Server shutdown.","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Stopping bundle loader.","name":"systems/b9ecfbbd532a470fb1fdb3fefab1f5f7","plugin":"bundle","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Stopping decision logger.","plugin":"decision_logs","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Flushing decision logs.","plugin":"decision_logs","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"All decisions in buffer uploaded.","plugin":"decision_logs","time":"2024-06-04T11:57:03Z"}
{"level":"info","msg":"Stopping status reporter.","plugin":"status","time":"2024-06-04T11:57:03Z"}
{"level":"error","msg":"Status update failed: Post \"http://styrasystem-slp/v1/status\": context canceled.","plugin":"status","time":"2024-06-04T11:57:03Z"}
{"level":"error","msg":"Status update failed: Post \"http://styrasystem-slp/v1/status\": context canceled.","plugin":"status","time":"2024-06-04T11:57:03Z"}

Setting the flag to e.g. 10, it will log the same as above, but with the extra Waiting 10s before initiating shutdown... line. This indicates that 10 seconds is not enough time to shutdown.

However, If I set the --shutdown-wait-period to something high like 50 seconds, the OPA will not log anything related to shutting down, but will just stop running:

{"client_addr":"100.78.24.1:59184","level":"info","msg":"Received request.","req_id":36,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:09Z"}
{"client_addr":"100.78.24.1:59184","level":"info","msg":"Sent response.","req_id":36,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.389676,"resp_status":200,"time":"2024-06-04T12:28:09Z"}
{"level":"info","msg":"Status update sent successfully in response to bundle update.","plugin":"status","time":"2024-06-04T12:28:09Z"}
{"level":"info","msg":"Waiting 50s before initiating shutdown...","time":"2024-06-04T12:28:10Z"}
{"client_addr":"100.78.24.1:59188","level":"info","msg":"Received request.","req_id":37,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:11Z"}
{"client_addr":"100.78.24.1:59188","level":"info","msg":"Sent response.","req_id":37,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.787576,"resp_status":200,"time":"2024-06-04T12:28:11Z"}
{"level":"info","msg":"Status update sent successfully in response to decision log update.","plugin":"status","time":"2024-06-04T12:28:12Z"}
{"client_addr":"100.78.24.1:59196","level":"info","msg":"Received request.","req_id":38,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:13Z"}
....
{"client_addr":"100.78.24.1:57560","level":"info","msg":"Received request.","req_id":48,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:33Z"}
{"client_addr":"100.78.24.1:57560","level":"info","msg":"Sent response.","req_id":48,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.378612,"resp_status":200,"time":"2024-06-04T12:28:33Z"}
{"level":"info","msg":"Status update sent successfully in response to bundle update.","plugin":"status","time":"2024-06-04T12:28:34Z"}
{"client_addr":"100.78.24.1:39568","level":"info","msg":"Received request.","req_id":49,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:35Z"}
{"client_addr":"100.78.24.1:39568","level":"info","msg":"Sent response.","req_id":49,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.413382,"resp_status":200,"time":"2024-06-04T12:28:35Z"}
{"client_addr":"100.78.24.1:39570","level":"info","msg":"Received request.","req_id":50,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:37Z"}
{"client_addr":"100.78.24.1:39570","level":"info","msg":"Sent response.","req_id":50,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":1.012793,"resp_status":200,"time":"2024-06-04T12:28:37Z"}
{"client_addr":"100.78.24.1:39580","level":"info","msg":"Received request.","req_id":51,"req_method":"GET","req_path":"/health","time":"2024-06-04T12:28:39Z"}
{"client_addr":"100.78.24.1:39580","level":"info","msg":"Sent response.","req_id":51,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.483066,"resp_status":200,"time":"2024-06-04T12:28:39Z"}
{"level":"info","msg":"Status update sent successfully in response to decision log update.","plugin":"status","time":"2024-06-04T12:28:39Z"}
ashutosh-narkar commented 6 months ago

Do you set any value for --shutdown-grace-period?

bdjgs commented 6 months ago

No, I couldn't really see any difference when setting the flag. For example, if I set it to 50 and deleted the OPA pod, it would just close down as if the flag wasn't set.

ashutosh-narkar commented 6 months ago

The plugin manager will create a new context with the timeout value set to the Graceful Shutdown Period (--shutdown-grace-period). A quick look at the code in the status plugin and it looks like it may be not be taking that timeout into account iiuc.

stale[bot] commented 5 months ago

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.