Open trojanh opened 6 years ago
Hi, sorry for the delayed response. Basically, I have a two-tiered supervisor system.
The easiest one to explain is the DynamicSupervisor part. Once a slack worker is up and running properly, I add it to the DynamicSupervisor. That way, if it crashes for some transient error, the dynamic supervisor will restart it. It needs to be dynamic because I do not know at start time how many workers I need. As users go through the OAuth flow to add the app, I need to start a new worker.
DynamicSupervisor.start_child(Worker.Supervisor, {Worker, team})}
Okay, so this handles any transient errors. However, as you have seen, if the user removes the app from their slack workspace the following happens:
1) The worker process will fail with a :ssl_socket exception 2) The dynamic supervisor will notice the crash and attempt to restart the worker 3) The worker will crash with auth_invalid 4) The dynamic supervisor will notice the crash and attempt to restart the worker. 5) Go to step 2, until the max_restart policy is hit and the dynamic supervisor gives up.
To remedy this, I add my manual supervision on top of it. I leave any of the transient errors up to the DynamicSupervisor, but on initialization, I manually handle certain crashes. Namely, if SlackBot.start_link throws either a invalid_auth or account_inactive error, I return :ignore
. This means the worker will not be dynamically supervised, and therefore won't death roll.
So, that solves one problem, but I still want to do something about the app. My solution is to write a bit in my database that says the user has removed the app, and then I won't try to start it again.
Finally, With my TeamMonitor, I poll the database every minute to see which teams I need to start. Since this polling ignores slack workspaces with the bit set above, I won't keep trying to start dead accounts.
The inner tier is left to the DynamicSupervisor. During runtime, I will add bots as users use the OAuth flow to add my app to their workspace. Let's ignore for the moment how they get started
@trojanh from what I can see, the biggest problem you have is that your child slack processes are not under supervision processes = Slack.Bot.start_link(MyApp.SlackBot, [], bot.bot_access_token)
Therefore, if they crash for any reason, you won't know about it and they won't get restarted. However, once you fix that (say with a dynamicsupervisor), you'll run into the auth_invalid problem above, and then can play with the solution I have.
What would be helpful for me would be to know what your requirements are. Do you poll for new teams? What do you do when teams remove your app? That would help me think about how to genericize my code
You can check following code to start the Bot on server startup, it handles most of the errors and deletes the configurations from DB including the case when the user has removed it from DB. https://gist.github.com/trojanh/0f8491e8930333a5b447282201686de6
Also I provide a screen on UI for viewing list of connected teams and delete them directly from that screen(which will remove the Team tokens and prevent any communication from that team unless it is added again explicitly.
I think we can have chat on this over this slack team personally so that we can get quick replies. https://elixir-lang.slack.com/
There is also a Slack endpoint to revoke the tokens which i used while deleting the tokens https://api.slack.com/methods/auth.revoke
I was trying to understand your way of managing slack teams by DynamicSupervisor, correct if I am wrong but this Worker.ex is called in Application.ex as Worker.Supervisor?
I see you have made a Worker file which has the code to start the bots for Team which I suppose is used to start_up all the saved Team bots from DB. I am not able to understand from are you invoking this Module. I see Worker.start_link is called no where. If you could give me some cues for this I will be able to understand how it has been done and avoid Bot crash as you said above.
I am willing to understand how you added them to DynamicSupervisor tree, as it needs to be invoked by Slack.Bot.start_link() but I see you haven't added it anywhere in the tree.
Thanks in advance.
I have a project where I am starting all the SlackBots for teams and am doing that as below:
Bot.ex
my_app.ex
This seems to be working fine but most of the it crashes with error
I have been digging a lot since past few months with GenServer and Supervisor but nothing worked. I think you are using a similar thing in your Project. You need to start all the team Bots together on start like me. But when this websocket error is thrown it is impossible to trace the logs and cause for crash, which is fine but we have no idea that our Bot is inactive on Slack and only know when a user tells us. I want a way to restart or prevent crash in any circumstances. What approach would you recommend for this? What purpose is DynamicSupervisor serving for you ?
Also is your Slack Bot stable with your design? Does it crash randomly?
Thanks in advanace for understanding this.