nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

request for on site help troubleshooting a down ethernet links (wrk-88) #792

Open aabaris opened 4 weeks ago

aabaris commented 4 weeks ago

We lost connectivity to wrk-88 server.

We would like to rule out low level issues by

Server: wrk-88 (Dell R740xd) Serial Number: 4VZ81S2 Location: R2-PC-C6 - U33

Please verify the server serial number in addition to it's rack location. I believe the rack is accessible via MOC keyset.

Thank you.

Augustine

dclyon commented 4 weeks ago

Hi Augustine,

Upon inspection, the twinax device ports were dark and switch-side ports were blinking amber.

After re-seating switch-side, both the device and switch ports are signaling green and appear to have link.

Also, I can confirm that the R740xd located in R2-PC-C06 U33 is indeed machine 4VZ81S2.

I was able to access this cabinet with the NERC keyset.

Please let me know if you have any questions or if I can help further. I will send an email with this response as well.

Cheers,

Chase

joachimweyl commented 1 week ago

@aabaris or @hakasapl can we confirm this is now resolved?

aabaris commented 1 week ago

@aabaris or @hakasapl can we confirm this is now resolved?

This was not resolved. Cable re-seat brought links back up but only temporarily.

Possible next steps: 1) replace the cables: A bit strange that both interfaces are affective by a bad cable, but low level issues are possible. Do we have a supply of replacement twinAx cables to try this?

2) replace the ethernet card: This system is not under warranty, I am not aware of us having spare parts for it, since we put all the V100 nodes into use.

aabaris commented 1 week ago

@aabaris or @hakasapl can we confirm this is now resolved?

This was not resolved. Cable re-seat brought links back up but only temporarily.

Possible next steps:

1. replace the cables: A bit strange that both interfaces are affective by a bad cable, but low level issues are possible.  Do we have a supply of replacement twinAx cables to try this?

2. replace the ethernet card: This system is not under warranty, I am not aware of us having spare parts for it, since we put all the V100 nodes into use.

I took another look about option 2. Ethernet is handled by Onboad Broadcom Adv. Dual 10Gb Ethernet, it's not a PCI card making component swap more complicated and possibly requiring a whole motherboard swap.

aabaris commented 1 week ago

@aabaris or @hakasapl can we confirm this is now resolved?

Apologies for not noticing this sooner (today I have a lot of multi-tasking going on).

While initial re-seat did not fix network links, when I re-checked the nodes status today, it's networking appears to be working just fine. Both NICs report a 10Gb link and the server is reachable in openshift. I don't like problems that resolve without an explanation, but at the moment it appears that there is nothing to fix.