sopel-irc / sopel

:robot::speech_balloon: An easy-to-use and highly extensible IRC Bot framework. Formerly Willie.
https://sopel.chat
Other
950 stars 402 forks source link

CPU usage spike when starting up #1862

Closed ralienpp closed 4 years ago

ralienpp commented 4 years ago

Description

I am running Sopel on a fleet of not-so-powered IoT devices. When it starts on RaspberryPi Compute Module 1 (Raspi CM1) it is very noticeable how there is a high CPU usage pattern that continues for ~25s, and then Sopel's footprint declines as the process is mostly idle.

This causes a major problem during system boot, if Sopel starts along with everything else - the system's chip gets very hot and the machine freezes.

Note that the machine has 512MB RAM and a 700Hz CPU, I wouldn't say that's "underpowered", though it is, relative to other RaspberryPis.

Reproduction steps

  1. Set sopel to start automatically via systemd
  2. Reboot the machine

Environment

Notes

Question

While it is something that can be resolved using systemd's delayed start machanisms, I would like to know what is it that Sopel is doing that causes this behaviour, and how can I make it easier.

Having observed the process with strace I noticed that it opens various directories and looks for something (presumably plugins to load).

Would it be possible to simplify Sopel's life a bit, for example by telling it explicitly where to look, and what not to load?

Exirel commented 4 years ago

Having observed the process with strace I noticed that it opens various directories and looks for something (presumably plugins to load).

Would it be possible to simplify Sopel's life a bit, for example by telling it explicitly where to look, and what not to load?

I don't know what you are referring to exactly. It is possible, though, that the Python import mechanism (which is built-in to Python) is causing this since it'll try several things to properly import Python modules (which is what a Sopel plugin is in the end: a Python module), but I've no idea how to confirm or infirm that hypothesis.

Beside that problem and lack of logs, I don't own a RaspberryPi or anything close to its specs, so it's kind of hard to replicate and see it by myself. I don't know if someone on the Sopel team has the resources to do that either.

dgw commented 4 years ago

I don't know if someone on the Sopel team has the resources to do that either.

My dad would probably loan me his old RPi for testing if I asked. But that's beside the point.

@ralienpp Like any other software, we have to draw a line somewhere and say "This is the minimum hardware capability you need to run it." And we're not even running into such a limit—you said the problem only happens when the whole system is stressed from starting a ton of services all at once on boot. It could be thermal throttling (you mentioned the chip gets hot). In that case the solution would be to add a fan/heat-sink—not a code change in Sopel.

Honestly, I'd say to take advantage of systemd's delay-start feature and move on. We're not likely to spend time optimizing Sopel for ultra-low-horsepower systems. I'd certainly review a patch attempting to improve things, but I won't write one myself. I think @Exirel is probably right that this is from some internal Python mechanism we can't control anyway. (Feel free to submit detailed logs in case someone gets bored and wants to review them, though. It could be something we're doing.)

ralienpp commented 4 years ago

My intention is to attempt to resolve this issue myself, so I am just asking questions to get a better idea of the startup process.

I doubt that this is a generic problem with how Python's module import works because:

I would have provided the logs if I could. The problem is that a RasPi compute module has embedded memory (not a removable flash card); once the image is uploaded into it, I power up the board and it freezes during the boot process. I can't get the log out, nor can I access its memory in some other way. So my debugging means are rather limited here; I can only re-flash another image with Sopel disabled, and try something new.

You're right about the delayed start workaround and I will apply it as a last resort. However, while 700Mhz with 512MB RAM is not a powerful machine, it is definitely not a "Tamagotchi-level" device :-) Heck, we've landed on the moon using a fraction of this computing power! The ice-caps are melting, so it is my moral duty to at least try and find some low-hanging fruits before applying the workaround.

It would be very helpful to me if you could guide me in this process a bit, providing an outline of Sopel's startup routines.

  1. Do you have any suspects when it comes to what might be causing the issue? (even a wild guess would suffice)
  2. What does Sopel do, besides traversing some directories and looking for plugins to load?
  3. Is there something in Sopel's 6.x->7.x transition that changes the way the bot starts?
dgw commented 4 years ago

I guess running on esoteric hardware brings esoteric problems. 😁

On our IRC channel, I speculated a bit with @Exirel. We concluded that if it's not Python's own module search (your observations say that's unlikely), the only other suspect we could think of is the code to find available plugins, located in sopel/plugins/__init__.py. There are a fair few syscalls in some of those routines, and we're certainly open to e.g. alternative ways of listing directories that won't hammer the disk so much.

Other subsystems (like the IRC connection handling) received substantial changes between 6.5.x and 7.0.x, but I think the plugin searching code is definitely your prime suspect—and should be easily confirmed by noting down some of the specific directories you're seeing accesses for in strace.

Exirel commented 4 years ago

I would have provided the logs if I could.

Ahem. So you're saying that you ran a strace on the process... when exactly? And how is that strace relevant to the problem then? I mean, I'm all curious and all for this kind of problem because usually it makes me learn stuff about Python, which I consider a good thing. Also it can solve problems, so, again, good. But I don't really understand how you make your hypothesis here.

Now, for what you ask, what I know is that the setup phase is called here: https://github.com/sopel-irc/sopel/blob/master/sopel/cli/run.py#L77-L88

Then, said setup function is defined here: https://github.com/sopel-irc/sopel/blob/25f1bea4303585433bdb4a3ccbc884d7924562b0/sopel/bot.py#L153

This one does 3 things:

  1. setup the logging
  2. setup the plugins (find, load, and setup them)
  3. start a scheduler

Once all of these are setup, the bot can be started over here: https://github.com/sopel-irc/sopel/blob/25f1bea4303585433bdb4a3ccbc884d7924562b0/sopel/irc/__init__.py#L169-L184

It doesn't do much: get the backend class/object, initiate the connection, and then run forever, waiting for a connection.

By the way, maybe you want to make sure that your systemd service is started only once the network is available, or you might totally have problems here.

Heck, we've landed on the moon using a fraction of this computing power!

You forgot the huge team and super crazy talented people around it. Also, trying to guilt trip people is not my cup of tea.

dgw commented 4 years ago

Turns out my dad was all too happy for me to borrow his RPi. It's one of the earliest versions, labeled 2011.12 on the PCB; lscpu reports 700 MHz and free -mh reports total 432Mi of RAM. (I guess some is reserved for video, since I have a monitor connected.)

I used a Raspbian 10 image and installed Sopel under Python 3.7. I can't reproduce any apparent hanging during startup when Sopel is enabled as a systemd service (using our contrib template). The output scrolling by on my display doesn't noticeably pause more than before I enabled it.

Obviously this isn't exactly your setup (slightly different hardware, slightly different OS environment), but it's close enough that I wonder what other confounding factors are present on your CM1 devices. Clearly a 700 MHz single-core RPi should be able to handle Sopel starting just fine.

ralienpp commented 4 years ago

Thank you all for the quick and thoughtful responses, I greatly appreciate your support. Having considered this problem for a while, here are some observations I would like to share:

First of all, you can see a screen capture of a terminal where the problem manifests itself: 1 minute, https://youtu.be/RTLjUeOE2yM (Sopel 7.0.3 is installed in a Python3 virtualenv without any custom plugins). This runs on a Raspbian 10, on a RaspberryPi Zero, also with a 700 MHz processor and 512 MB RAM.

There are several windows multiplexed with byobu (it is a more user-friendly analog of tmux or screen), their titles are at the bottom; I'm switching between them to show the output of htop in one window and the state of sopel in another. I know this is not very visually exciting :-) , but I hope it illustrates the problem at hand.

Second, I ran a similar experiment with another Python program that starts the interpreter, connects to a server and continuously sends temperature and humidity telemetry that it reads from a hardware sensor. The transmission is done using a protocol called MQTT. When I run this program I see a short spike in CPU use, so it gets back to normal rather quickly. I have a video recording of that too, but it is even less exciting. My conclusion is that it is not the Python interpreter itself that causes it, but rather the code that it executes.

My third step was to apply some profiling techniques and see what function calls the process spends most of its time in. To this end, I patched Sopel, as follows: image

I then ran it and waited for it to successfully connect to the server, at that point I press Ctrl+C to end it. Then I inspected the profiler's data using a nifty tool called snakeviz (pip install snakeviz); you run it as snakeviz -H 0.0.0.0 --server chatperf.profile and then observe its output with a browser.

image

I was expecting a great deal of os.listdir or something like that - but that wasn't the case (there were only 18 calls overall, and they lasted a total of ~0.0006s). Here are the most frequently called functions: image

Here's a similar list, sorted by total time spent in function. Note that the greatest "offender" is taking ~6s and it is select - so I presume this already happens when Sopel connects to the server (at that stage the CPU use is back to low levels). In the video recording you can see that this whole thing takes ~30s - but that is not reflected in this profiling data: image

Of course, there's context switching, and the OS is doing something else too, but my gut feeling tells me that I am overlooking some major factor in this observation. My guess is that there must be a better place to start the profiler's probing. For example, I am assuming that modules are imported before run is invoked, and maybe that's where all the action takes place. I will keep tinkering with it tomorrow. If the materials above create some new ideas in your mind, please feel free to share them.

Now, to address some of the other questions and remarks.

Ahem. So you're saying that you ran a strace on the process... when exactly?

I did it in 2 ways:

And how is that strace relevant to the problem then?

My intention was to get some insights into the mental state of Sopel's process. Was it "busy-waiting", was it computing the 9034581293416th digit of pi, or was it idling inside a select call, or perhaps calling the same function in a loop?

At this stage I am not yet informed enough to see something suspicious in that output, but I have to start with something.

By the way, maybe you want to make sure that your systemd service is started.

That is already the case.

You forgot the huge team and super crazy talented people around it.

Aren't we talented too? As Sopel is not at a comparable level of complexity, perhaps even normal people like us have a shot :-)

Also, trying to guilt trip people is not my cup of tea.

I am sorry if that is what you saw in my message. The meaning I invested into it is that energy is a finite resource, and it would be wise to use it sparingly, otherwise we'll end up in the "sad end" of a "tragedy of the commons" (https://www.youtube.com/watch?v=CxC161GvMPc). Note that I wrote "my" moral duty, not "yours" or "our". My gut feeling tells me that there's something odd about it, the least I can do is ask a few questions, set some breakpoints and look at some logs. Who knows - maybe there will be an easy solution. I did not imply that I expect others to immediately focus on this problem, which so far only affects but me.

I am extremely grateful for the time you've already spent pondering over this problem, and before that - for making Sopel what it already is. Anything on top of that is a nice bonus, but I totally understand that other people's priorities are not necessarily aligned with my own.

dgw commented 4 years ago

We don't have a fancy bot to close inactive issues (and thank goodness for that; it would have carpet-bombed the issue tracker during the couple years when the project was dormant), but I'm happy to do it manually.

We can keep discussing this whenever new info or new theories come up, but unless this discussion identifies a specific thing we can fix in Sopel it's not productive to keep it in the list of open issues. 😸