skoczen / will

Will is a simple, beautiful-to-code bot for slack, hipchat, and a whole lot more.
https://heywill.io
MIT License
406 stars 171 forks source link

High CPU usage and Will takes 10+ seconds to respond on Slack #384

Open Ashex opened 6 years ago

Ashex commented 6 years ago

I've actually been dealing with this issue for a few months and haven't gotten to the bottom of it. Will is continuously consuming 100% CPU on a single core and has continued to do so despite allocating more CPU to it (upgraded aws instance type from m4.medium to c4.4xlarge).

I've also been dealing with an issue that might be related where Will takes 10 seconds or more to respond to a command and seems to only get worse until it's restarted.

I would love to debug and investigate this further but it's not clear how to do so.

Edit: I tweaked the logging for the execution of _watch_rtm_thread to get the process id and that is not the thread causing the high cpu usage. Edit Edit: Added the same logging to everything using the Process command and have observed that the Analysis Thread (in bootstrap_analysis) consumes 100% CPU when a command is issued.

Ashex commented 6 years ago

It's not immediately clear to me how the context of a command is used by default so I set the analysis engine to Nothing and willnow responds to every single command instantly. I'll continue observing how will behaves and report back.

dawn-minion commented 6 years ago

@Ashex Thank you for posting this. Hit the exact same issue this morning. Now, what's interesting is this behaviour is not exhibited locally on my development machine - If I launch it and let it connect to Slack, it never hits this high cpu usage. However, when I deploy on the actual server, it hits 100% immediately after the first message like you describe.

If I get time I'd like to run a profiler against this section and see what on earth it's doing.

Ashex commented 6 years ago

@dawn-minion Good to know I'm not alone! I have the same situation where there's no cpu consumption issue happening on my laptop (OSX High Sierra), only in production. Five days later will continues to respond to everything instantly with no delay though that single mystery subprocess continues to consume 100% of one core.

Post back with the results of running a profiler (I've never done it so wouldn't know where to start) against will!

dawn-minion commented 6 years ago

@Ashex How strange it works fine locally for you as well. Ubuntu 18.04 locally and 16.04 on the server. Hopefully will have time today to run that, and will post results here too. Thanks again =)

dawn-minion commented 6 years ago

@Ashex So, after a good amount of debugging, I've narrowed it down to this call here - Disabling this line (and the rest of the logic) makes it so there's no abnormal CPU usage. Reenabling only this line in the plugin makes the 100% CPU lock return.

Just for fun, I tried dumping the contents here, and it was enormous - I don't have an exact number of entries, but, it seemed to go on for some time. Looking at the logic, I don't see any limit on how many messages it stores either.

To test if this was the cause, I cleared the entries via self.clear("message_history"), restored the history backend back to normal, and tried a command against the chatbot. No issues at all. I'm still not exactly sure why the message history was enormous, or why it would have worked fine on HipChat. I'm going to let this run again for a few days and keep an eye on the performance

Ashex commented 6 years ago

Interesting, I knew it was something in there and only had a hunch about the messages. It's a bit weird as according to the code it already slices the list down to 19 elements before appending one.

The thing that bothers me about this is there doesn't appear to be anything that is segmenting the channels so you potentially just have a long list of messages from different channels.

dawn-minion commented 6 years ago

@Ashex Looking at the code though, doesn't it simply chop it to 19 elements for the context object, but the size in Redis/storage is unlimited? Since only the context object seems to be affected here, and the original history object remains the same.

If that really comes from all the channels, that would be a truly huge amount. Really should be limited to wherever it came from like you say.

Ashex commented 6 years ago

I'm crawling through things and from what I see the analysis backend is called via the pubsub backend which is triggered for each incoming message. Per the doc on the history analysis backend it's capturing the last 20 messages so that they can be used for contextual analysis of a command, therefore it needs to be recording messages from every channel will is a member of.

This is a marginally educated guess as I haven't a clue how the hipchat backend functions but we're using the slack RTM feed so will is actually processing every message in the slack workspace which could mean those are all being added to message_history. I modified my codebase to include a channel acl that involves discarding messages not from a channel in the acl list before it's processed, the PR for it is here. So it's possible that this change could reduce the severity but doesn't exactly address the issue of growth.

And you're absolutely right about the history object, I completely missed that. I don't really see where else message_history is being used except here. I'll do some tinkering tomorrow and dump message_history to see how far back it goes along with inspecting the context object.

Seems like sharding by channel within the dictionary would help as we could then chop the channels message history to 19 anytime a message from that channel came in. As it stands message_history is a gigantic array with no real structure.

pastorhudson commented 6 years ago

Would it be better to switch to the slack events API? That would drastically reduce the amount of data will has to process.

wontonst commented 5 years ago

381