Closed e-carlin closed 6 months ago
Perlmutter is currently down. I tried to ssh in a got a connection timeout error. No specific message that it is down for maintenance. I also tried calling the nersc api which should be able to return the status but I got a 500 back. So, it may be that there isn't a reliable and programatic way to tell nersc is down for maintenance or maybe today it is more messed up than normal.
I think if there is a connection error we should link users to the status page (which doesn't call the public status api) and tell them they can reach out to us. We don't need to separate the case that the error was on our end vs nersc. Too hard to tell which is which.
This would work: curl -s -S https://www.nersc.gov/live-status/motd/ | grep -i -q perlmutter.*down
Do you not worry about that being fragile? perlmutter.*down
could easily appear in the planned outages section (HPSS.*down currently does).
Furthermore, after looking at this issue it seems solving it seems like overkill. Do we really need a different message when nersc is down vs some other error? I think a message like this would cover all cases
Unable to connect. Perlmutter may be down <status-link>. Please contact support@sirepo.com if the issue persists.
I find it hard to justify introducing code to add/eliminate the middle sentence in that message.
I don't worry about fragility. It would only be checked when we fail to connect so it's likely it's down, just verifying.
I agree that there's no need for the extra check.
I was just addressing the public API. If there's a website, it's public and an API. As you saw, even when there is a public API, it's fragile. Indeed, I would argue APIs are often more fragile than traditional HTML pages (I'm not talking about scraping Javascript). People often say "don't parse HTML", and I say, well, HTML is easy to parse, often easier than complicated public APIs.
5/15 is the next maintenance period. Figure out if there is a specific error that comes back through the job_driver.