nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Test System board fault on wrk-99 #492

Closed aabaris closed 1 month ago

aabaris commented 3 months ago

Lenovo Node SD650-N V2 reports: system board voltage fault Location: R8-PA-C23 U38 OBM: 10.30.0.136 Serial number: J70159YT

joachimweyl commented 3 months ago

@hakasapl do we have communication with Lenovo on repairing/replacing this GPU Node?

joachimweyl commented 3 months ago

@hakasapl please create a new issue if we need to work with Lenovo. This should be closed as soon as we know if we can fix it ourselves.

hakasapl commented 3 months ago

Reseated, awaiting testing

aabaris commented 3 months ago

Reseat initially cleared up the fault, but the system board fault re-asserted when trying to power the system on. I will open lenovo support case.

aabaris commented 3 months ago

Lenovo support case# 3000351970

aabaris commented 3 months ago

Lenovo will send a technician to replace the system board.

aabaris commented 3 months ago

Replacing the system board did not remedy the problem. Lenovo technician ordered a new CPU and another system board, will attempt to repair the server when parts arrive at MGHPCC.

joachimweyl commented 2 months ago

Any update from Lenovo on when they plan to fix this?

aabaris commented 2 months ago

Server started exhibiting same or similar problems (system board fault and refusal to power on).

I opened a new lenovo repair case #3000361273 (strangely I could not find a way to add a backup contact for the ticket, I will work with lenovo support to figure out how that could be done). (original case was #3000351970)

hakasapl commented 1 month ago

@aabaris is this node fixed now?

aabaris commented 1 month ago

@aabaris is this node fixed now?

Yes, I believe the last round of parts replacements successfully repaired the system.

@joachimweyl created a ticket for adding this node to production https://github.com/nerc-project/operations/issues/557 though this might be currently blocked by NERC changes freeze.