phetsims / aqua

Automatic QUality Assurance
MIT License
2 stars 4 forks source link

Improve continuous server error handling #146

Closed zepumph closed 2 years ago

zepumph commented 2 years ago

Looks like ContinuousServer has been running on bayes for the last 26 hours, and in the time, it has restarted 531 times. It would be good to improve the error handling instead of putting that on pm2 to restart it. My guess it that this will make CT faster also, since it won't have to check stale each restart and potentially restart command line, non browser tests.

┌─────┬────────────────────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id  │ name                       │ namespace   │ version │ mode    │ pid      │ uptime │ ↺    │ status    │ cpu      │ mem      │ user     │ watching │
├─────┼────────────────────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ continuous-server          │ default     │ N/A     │ fork    │ 10202    │ 26h    │ 531  │ online    │ 0%       │ 506.3mb  │ phe… │ disabled │
zepumph commented 2 years ago

Unfortunately, it doesn't seem like this is showing up in the logs for the continuous-server process. The next step would be to check the pm2 logs themselves to see if there are errors about continuous-server crashing

zepumph commented 2 years ago

In .pm2/pm2.log, I see it noted, but without much help:

2022-06-08T13:23:26: PM2 log: Stopping app:continuous-server id:1
2022-06-08T13:23:26: PM2 log: pid=9549 msg=failed to kill - retrying in 100ms
2022-06-08T13:23:26: PM2 log: App [continuous-server:1] exited with code [3] via signal [SIGINT]
2022-06-08T13:23:26: PM2 log: pid=9549 msg=process killed
zepumph commented 2 years ago

Ahh! I actually think I understand, Right when I started the service 26 hours ago, there was an infinite loop of failures because we needed to npm install in aqua/. After doing that, there hasn't been a crash (hence why the uptime is 26 hours). If it had been restarting often, then the uptime would be much less than other processes, which were all started 26 hours ago. Closing