Closed aabaris closed 1 month ago
@hakasapl do we have communication with Lenovo on repairing/replacing this GPU Node?
@hakasapl please create a new issue if we need to work with Lenovo. This should be closed as soon as we know if we can fix it ourselves.
Reseated, awaiting testing
Reseat initially cleared up the fault, but the system board fault re-asserted when trying to power the system on. I will open lenovo support case.
Lenovo support case# 3000351970
Lenovo will send a technician to replace the system board.
Replacing the system board did not remedy the problem. Lenovo technician ordered a new CPU and another system board, will attempt to repair the server when parts arrive at MGHPCC.
Any update from Lenovo on when they plan to fix this?
Server started exhibiting same or similar problems (system board fault and refusal to power on).
I opened a new lenovo repair case #3000361273 (strangely I could not find a way to add a backup contact for the ticket, I will work with lenovo support to figure out how that could be done). (original case was #3000351970)
@aabaris is this node fixed now?
@aabaris is this node fixed now?
Yes, I believe the last round of parts replacements successfully repaired the system.
@joachimweyl created a ticket for adding this node to production https://github.com/nerc-project/operations/issues/557 though this might be currently blocked by NERC changes freeze.
Lenovo Node SD650-N V2 reports: system board voltage fault Location: R8-PA-C23 U38 OBM: 10.30.0.136 Serial number: J70159YT