nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

No link on wrk-83, wrk-84 #313

Closed larsks closed 8 months ago

larsks commented 10 months ago

Nodes in chassis located at: R1-PC-C20 U27 have no network links.

There are 2 network pass-through modules in the back of this chassis, identified as A1 and A2 in the mapping here: https://www.dell.com/support/manuals/en-us/poweredge-fx2/fx2ownersmanual/io-module-port-mapping%E2%80%94eight-bay-chassis?guid=guid-626ab443-5844-4c0b-abd6-aea3bc7b3e18&lang=en-us

Both these modules should have all ports cabled to a top of a rack switch or switches. We need to have someone investigate their cabling status, identify if the cables are present and if so, identify what switch(s) they are plugged into.

aabaris commented 10 months ago

A virtual re-seat did not fix these nodes. After investigating the issue via CMC and iDrac OBM access, I believe we've exhausted our options of what can be remotely.

Even though I am going to MGHPCC on Monday 2023-11-20, I did not have key access to R1-PC-C20. Further more, even with key access I might not have resources to fix (let's say there's a cable missing or a broken piece of HW involved).

There is also an issue about missing link on other nodes in this very same chassis: https://github.com/nerc-project/operations/issues/220

Also, less urgent, but more HW issues in that same cabinet provided here: https://github.com/nerc-project/operations/issues/218

@joachimweyl could we ask for help? I'd like to help, but with no keys, no parts, long list of other tasks already on my list for this trip, I'm not in a good position to solve this problem.

hakasapl commented 10 months ago

Plan is for techsquare to take a look at this, conversations from slack:

Hakan Saplakoglu :spiral_calendar_pad: 1 hour ago Can you indicate the U number of the FX2 and the port on the chassis switch that is reporting down (or the slot of the server), that way we can just hand the ticket over to techsquare to look, but they don’t have any info about what wrk-## translates to

Hakan Saplakoglu
Usually for techsquare I update the description of the issue with the request, so if we could update the first post that would be great

larsks
If we've been updating the CMCs with hostnames we can probably do that...or if there's a way to get the chassis name and slot from the idrac. I'll take a look.

augustin
Regarding powering on the other nodes, I'm a bit apprehensive about just doing that now.  I don't have a lot of cycles to watch them and watch out for anything these devices might boot..etc..etc.  If you feel like you can do this test and immediately poweroff the nodes that were originally off, that would work.

augustin
I don't have teh CMC U locations - when I looked at the locked cabinet after they were moved to 1-c-20 I noticed the bottom Us were empty?

Hakan Saplakoglu
They were empty in the old location too, the rack was just swapped so the Us are the same so I think this file still applies, just a different cabinet

augustin
I have some data about co-relating CMS slots with FX nodes, that I'll pull together.
I tired redfish and ansibe omsa modules on CMCs but ran into old https issues.  I have just racadm cli dumps that I can co-relate with with regexps.

larsks
@Hakan Saplakoglu
 we could illuminate the indicator LED on systems that need some TLC, maybe?

Hakan Saplakoglu
Do we know if the LED stays blinking until it’s turned off, or is it temporary?

augustin
I should be able to provide a map for techsquare, I just need a bit of time to pull this info together. (some of it requires referencing the manual to be accurate about passthrough modules ports).

larsks
I believe the LED will stay illuminated until it is explicitly disabled (but I'm not 100% positive).

augustin
I will pull together location info and details about the known network link problems, so request could be sent to techsquare.  I can't context switch to it at this moment, but I have enough data to be able to tie all this together, so the issue could be worked on next week.

augustin
I'll need a bit of info on how to engage techsquare, but I'm not blocked on it as I need to gather aforementioned details first.

Hakan Saplakoglu
I can let them know, I usually just add a label on the issue and then email the techsquare team. They’ll then update the issue
aabaris commented 10 months ago

Nodes in chassis located at: R1-PC-C20 U27 have no network links.

There are 2 network passthrough modules in the back of this chassis, identified as A1 and A2 in the mapping here: https://www.dell.com/support/manuals/en-us/poweredge-fx2/fx2ownersmanual/io-module-port-mapping%E2%80%94eight-bay-chassis?guid=guid-626ab443-5844-4c0b-abd6-aea3bc7b3e18&lang=en-us

Both these modules should have all ports cabled to a top of a rack switch or switches. We need to have someone investigate their cabling status, identify if the cables are present and if so, what switch(s) they are plugged into.

aabaris commented 10 months ago

Nodes in chassis located at: R1-PC-C20 U27 have no network links.

There are 2 network passthrough modules in the back of this chassis, identified as A1 and A2 in the mapping here: https://www.dell.com/support/manuals/en-us/poweredge-fx2/fx2ownersmanual/io-module-port-mapping%E2%80%94eight-bay-chassis?guid=guid-626ab443-5844-4c0b-abd6-aea3bc7b3e18&lang=en-us

Both these modules should have all ports cabled to a top of a rack switch or switches. We need to have someone investigate their cabling status, identify if the cables are present and if so, what switch(s) they are plugged into.

Next step is to engage tech-square for help. I have never worked with them before and don't know what additional information they will need in order to work on this, I will pursue figuring that out early next week.

aabaris commented 9 months ago

Reviewing this issue I could use some help clarifying/establishing a few things.

1) Trying to follow: "I usually just add a label on the issue and then email the techsquare team" i) which label should I use? I haven't found anything related to tech square in labels available to me ii) I have never engaged tech square for help, what is their contact info? what do I need to say in order for them to route it properly? who is the customer for this request? (UMASS,MOC,BU?) 2) For the sake of clarity, would it make sense to create a new issue with just information from my 2nd to last comment in order to make the reqest easier to read?

I am tagging @joachimweyl and @hakasapl here, as those most likely to have the right answers but please let me know if I need to involve anyone else (like does @waygil need to be involved to establish BU as tech square customer)

Thank you!

joachimweyl commented 9 months ago

I don't think we need a new issue but if you want to move the pertinent information to the description at the top that would make things easier to read. @hakasapl can provide details about TechSquared.

aabaris commented 9 months ago

I don't think we need a new issue but if you want to move the pertinent information to the description at the top that would make things easier to read. @hakasapl can provide details about TechSquared.

I tried moving info to the top, but now what I said looks like it was posted by Lars (sorry). (he was the original creator of this issue).

hakasapl commented 9 months ago

@aabaris I'll engage with Techsquare and link them this issue. They will reply here.

imstof commented 9 months ago

We will get a look at this today, most likely in the afternoon.

yani4 commented 9 months ago

Seems that the nodes in chassis R1-PC-C20 U27 lack cables. There is a singular eth cable in Gb1 on the chassis but that's it.

hakasapl commented 9 months ago

@aabaris Harvard networking had this up initially - can we check with Nick for what cables they used (I believe they are 4x10G breakouts) and figure out what switchports we can use at the top of rack? Hopefully some of those cables are still around at MGHPCC

aabaris commented 9 months ago

Thank you for the update.

I am tagging and asking @jtriley to look at this, as I'm mostly a helpless in-between here.

Based on what I can tell remotely from the server side, a total of 16 SFP+ 10Gb/s TwinAx cables would be needed to cable all the 8 nodes in that chassis. Justin, could we ask Nick to confirm about the types of cables used and availability of such cables. If Nick does not have documented data, maybe it could be derived by tracking a couple of mac addresses in the R1-PC-C20 switches. Mac addresses 50:9a:4c:91:97:09 (wrk-5) and 10:7d:1a:9c:7a:af (wrk-9) should be representative of how working nodes are connected in that rack. Thank you! - A

aabaris commented 9 months ago

We have identified cables and will pursue purchase of 4 https://www.fs.com/products/48867.html

Communicating with Nick, a concern was raised that the chassis may be missing pass-through modules. It appears to me via remote mgmt interface that the modules are present.

Could I ask someone to take a picture of the back of that chassis as well as one of the whole rack. Just in case I'm blindsided by something obvious, it would help to have some view of what pieces are there.

Thank you for the help. - Augustin

yani4 commented 9 months ago

Sure, here you go.

IMG_20231201_171546326 IMG_20231201_171536315

aabaris commented 9 months ago

Sure, here you go.

Thank you very much!

waygil commented 9 months ago

We have identified cables and will pursue purchase of 4 https://www.fs.com/products/48867.html

Communicating with Nick, a concern was raised that the chassis may be missing pass-through modules. It appears to me via remote mgmt interface that the modules are present.

Could I ask someone to take a picture of the back of that chassis as well as one of the whole rack. Just in case I'm blindsided by something obvious, it would help to have some view of what pieces are there.

Thank you for the help. - Augustin

@msdisme can you assist with the purchase of these cables?

msdisme commented 9 months ago

We can, slight delay as Tara is out of town - I will work with hariri on this. @hakasapl do we have any of these cables already? https://www.fs.com/products/48867.html @waygil - confirming cisco?

waygil commented 9 months ago

There was a pallet of cables that we have from lenovo but haven't seen a manifest of what the types or qtys of cables are.

joachimweyl commented 9 months ago

@syockel where the cables from Lenovo Compatible 40G QSFP+ to 4 x 10G SFP+ Passive Direct Attach Copper Breakout Cable or just Cat6?

syockel commented 9 months ago

I don’t know. I was hoping that Hakan could open the boxes and let us know. I thought that they were just Cat6. I’ll ask Lenovo if they have BOM for those boxes.

Scott Yockel, PhD

University Research Computing Officer Harvard University Information Technology M: (817) 793-6634) | W: rc.harvard.edu https://huit.harvard.edu/==============================

On Dec 11, 2023, at 10:43 AM, Joachim Weyl @.***> wrote:

@syockel https://github.com/syockel where the cables from Lenovo Compatible 40G QSFP+ to 4 x 10G SFP+ Passive Direct Attach Copper Breakout Cable https://www.fs.com/products/48867.html or just Cat6?

— Reply to this email directly, view it on GitHub https://github.com/nerc-project/operations/issues/313#issuecomment-1850338014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAHDF5UEXKKM4KUEOXAQTTYI4S27AVCNFSM6AAAAAA7O2D6E6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJQGMZTQMBRGQ. You are receiving this because you were mentioned.

msdisme commented 9 months ago

@hakasapl - don't worry about looking - for now. Stephen Brown is offerign to help order them. We will order 4 to data center attention Hakan for now.

hakasapl commented 9 months ago

Okay, I think I misunderstood that these cables were actually there (mixed up with Weca cables). Either way I should be able to check tomorrow

msdisme commented 9 months ago

@hakasapl - don't worry about looking - for now. Stephen Brown is offerign to help order them. We will order 4 to data center attention Hakan and Augustine for now.

larsks commented 9 months ago

@hakasapl @aabaris do we expect these hosts to be working after the recent switch maintenance?

aabaris commented 9 months ago

@hakasapl @aabaris do we expect these hosts to be working after the recent switch maintenance?

@larsks I do not expect these nodes to be fixed until cables are purchased and installed. Switch maintenance was to replace the failed cooling fans on the switch, but wrk-83,84 (+ other 6) are offline due to missing network cables which are possibly/hopefully in the process of being purchased (you can take a look at photos posted in this thread for what I believe is the reason these nodes are offline).

hakasapl commented 9 months ago

@larsks Cables have arrived and I've installed them. Ports 21 and 22 on each switch (top switch is port 1 of each node, bottom switch is port 2 of each node), which matches existing config. It looks like the links are already lit up. Let me know if that works. Otherwise we probably need to check with Nick/Christian to update the switch config.

joachimweyl commented 8 months ago

@hakasapl sounds like this can be closed.

hakasapl commented 8 months ago

@larsks are the networking issues resolved for you on these nodes?

larsks commented 8 months ago

Yes, these issues appear to be resolved.