radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
64 stars 32 forks source link

job_agent sbatch should query job statuses on startup #6916

Open robnagler opened 8 months ago

robnagler commented 8 months ago

The sbatch_id for all jobs known to be running for that user should be save in the supervisor db. There needs to be a query option for finding all "uncertain outcome" sbatch ids. Then sacct should be called to figure out what happened and log that in the database.

Prereq: https://github.com/radiasoft/sirepo/issues/6914

robnagler commented 6 months ago

This came up again so please make this a priority.

robnagler commented 1 week ago

When the server restarts, the UI shows canceled, but this isn't necessarily true. It's an assumption in the UI.

We should probably put up a message to tell the user to refresh, or try to refresh automatically when the connection comes back up.

robnagler commented 1 week ago

The UI needs to coordinate sbatchLogin better:

20:17:35 api_runStatus
20:17:35 api_sbatchLoginStatus
20:17:35 api_simulationFrame
20:17:35 SRException sbatchLogin
20:17:40 api_runStatus
20:17:44 api_sbatchLogin
20:17:50 api_runStatus

The first login succeeds, but there's a second prompt for a login so the screen looks like (two levels of alpha):

two-logins

State machine needs to protect itself.

robnagler commented 1 week ago

State machine needs to protect itself.

The state machine needed to ignore some events such as an srException when already prompting for creds.