pharo-project / pharo-vm

This is the VM used by Pharo
http://pharo.org
Other
110 stars 67 forks source link

VM goes into 100% CPU utilization #828

Open noha opened 3 days ago

noha commented 3 days ago

Since last week I was noticing on our production system that in the morning one to four vms are using 100% CPU. The date matches the new release of the vm 1.2.1 and the a few days it tool for us to make a new release including that. I've search the process list in the image for something strange but could only find normal processes and number. I then made our deployment be able to go back to an earlier vm. So I moved our system back to 1.2.0 vm and the effect disappeared. I've watched this now for a few days. Before every day there was vms having 100% CPU and now there is none. At this point this is sadly the only context I can provide. To us the introduction of epoll() might be the nearest possible culprit. The images go most probably in 100% CPU mode at night because there the backup runs are happening with a high I/O load. I will try to preserve a manual triggered crash.dmp file. This needs a bit of effort due to the ephemeral nature of docker containers. This is the best bet to find something in crash.dmp as I don't know how it can be reproduced. So if you have better ideas about data I should collect please tell

tesonep commented 4 hours ago

Hi @noha, you are saying that the problem is when the backup is launched. What is the backup process doing? The VM is handling connections during that time or is it idle?

noha commented 2 hours ago

@tesonep The backup copies the contents of a database to another location (is a visitor that traverses one database and all blobs found are copied over). So it is just reading and writing a lot. But all within pharo using file streams (and calling ffi). The vm handling connections at the same time. Except one database instance it is not likely there are simultanuous connections at the same time. For that one database it is almost sure there are connections while it is being backed up. From the application side this is not an issue. Do you expect problems when things happen at the same time?

btw. I'm using p11