Timeouts or invalid request IDs happen intermittently

strongdm / accessbot

Manage access to strongDM resources via Slack

Apache License 2.0

23 stars 19 forks source link

Timeouts or invalid request IDs happen intermittently #223

Closed jeffalloy closed 2 years ago

jeffalloy commented 2 years ago

Describe the bug Sometimes a request gets sent and the message is visually seen in the request channel however when approving or rejecting the request the request id is nowhere to be found

This is well below the SDM ADMIN TIMEOUT as well

To Reproduce Steps to reproduce the behavior:

Ask for a resource request
Sometimes request ID is invalid

Expected behavior Request is valid and access is granted

Screenshots

camposer commented 2 years ago

Hi @jeffalloy

Thanks for sharing this. Very interesting issue, looks like the request is timing out before someone can approve it.

In the example you've shared above, chuan is entering a wrong id (the id is case sensitive) and then failing when lee enters it properly.

The default timeout for approving an access request is very low, 30sec. Could you please try to increase it, see here

Thank you,

jeffalloy commented 2 years ago

Hi @camposer , we increased the timeout to 3600 seconds (1 hour) and sometimes it still fails.

camposer commented 2 years ago

Interesting @jeffalloy

The only use case I can think of:

The request expired
The request was already approved
The bot was restarted - access requests are stored in memory

Could you please confirm if any of the points above applies? If not, please send your logs and a reference to this issue to: support at strongdm.com

jeffalloy commented 2 years ago

Interesting @jeffalloy

The only use case I can think of:
* The request expired

* The request was already approved

* The bot was restarted - access requests are stored in memory
Could you please confirm if any of the points above applies? If not, please send your logs and a reference to this issue to: support at strongdm.com

Hi @camposer when it happens again, I can send logs and reference to this issue and will close out this for now.

lslaslo commented 2 years ago

@camposer we do see frequent restarts of the service in Fargate. Anything we can do to make this more resilient?

camposer commented 2 years ago

@lslaslo there's a plan for persisting access requests on disk. You'd need to mount a volume with EFS or similar, but it could make the trick. I'll communicate your request to the team.

lslaslo commented 2 years ago

@camposer did a persistent storage option ever happen?

camposer commented 2 years ago

Hi @lslaslo

Thanks for reaching out! Yes, it was implemented. Please take a look at the docs