radiasoft / sirepo

Sirepo is a framework for scientific cloud computing. Try it out!
https://sirepo.com
Apache License 2.0
64 stars 32 forks source link

Sbatch provide specific message for nersc down for maintenance #7041

Closed e-carlin closed 6 months ago

e-carlin commented 6 months ago

5/15 is the next maintenance period. Figure out if there is a specific error that comes back through the job_driver.

e-carlin commented 6 months ago

Perlmutter is currently down. I tried to ssh in a got a connection timeout error. No specific message that it is down for maintenance. I also tried calling the nersc api which should be able to return the status but I got a 500 back. So, it may be that there isn't a reliable and programatic way to tell nersc is down for maintenance or maybe today it is more messed up than normal.

I think if there is a connection error we should link users to the status page (which doesn't call the public status api) and tell them they can reach out to us. We don't need to separate the case that the error was on our end vs nersc. Too hard to tell which is which.

robnagler commented 6 months ago

This would work: curl -s -S https://www.nersc.gov/live-status/motd/ | grep -i -q perlmutter.*down

e-carlin commented 6 months ago

Do you not worry about that being fragile? perlmutter.*down could easily appear in the planned outages section (HPSS.*down currently does).

Furthermore, after looking at this issue it seems solving it seems like overkill. Do we really need a different message when nersc is down vs some other error? I think a message like this would cover all cases

Unable to connect. Perlmutter may be down <status-link>. Please contact support@sirepo.com if the issue persists.

I find it hard to justify introducing code to add/eliminate the middle sentence in that message.

robnagler commented 6 months ago

I don't worry about fragility. It would only be checked when we fail to connect so it's likely it's down, just verifying.

I agree that there's no need for the extra check.

I was just addressing the public API. If there's a website, it's public and an API. As you saw, even when there is a public API, it's fragile. Indeed, I would argue APIs are often more fragile than traditional HTML pages (I'm not talking about scraping Javascript). People often say "don't parse HTML", and I say, well, HTML is easy to parse, often easier than complicated public APIs.