paregupt / ucs_traffic_monitor

Cisco UCS traffic monitoring using Grafana, InfluxDB and Telegraf
MIT License
81 stars 25 forks source link

any chances backplane RX/TX pause counters are messed up? #41

Closed andrico21 closed 4 years ago

andrico21 commented 4 years ago

Just trying to build a picture of my infrastructure issues and got stuck, the only moment needed to clarify is below: maybe there are error either in graph explanation (or even in UCS counters description) for "Backplane port RX PAUSE stats" and "Backplane port TX PAUSE stats"...

Because according to the graphs description my blades are able to receive any traffic more (almost no RX pauses on the chassis backplane), but both IOMs just screaming to servers for pauses like that... _TX_Pauses_Chassis_Backplane

the same time both IOMs telling to FIs to stop ingress (RX pauses on FI ports), but I see no reason they do that...

The only confusing moments my servers are really overloaded with traffic processing (very heavy CPU load and no jumbos) and more likely they are screaming to IOMs to stop... and if the counters are really messed up - the puzzle is solved, but now it looks pretty weird: IOMs just sending a lot of pauses in both directions and all the interfaces utilized not more than 35% max...

andrico21 commented 4 years ago

_TX_Pauses_Chassis_Backplane2 commented on graphs

eminchen commented 4 years ago

It may worth to open TAC case to check the logs and QoS setup to make sure you are not hitting any known bugs.

Regards, Eugene

From: andrico21 notifications@github.com Reply-To: paregupt/ucs_traffic_monitor reply@reply.github.com Date: Tuesday, September 1, 2020 at 10:19 AM To: paregupt/ucs_traffic_monitor ucs_traffic_monitor@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [paregupt/ucs_traffic_monitor] any chances backplane RX/TX pause counters on the backplane are messed up? (#41)

[Image removed by sender. _TX_Pauses_Chassis_Backplane2]https://user-images.githubusercontent.com/44984087/91862865-2557dd00-ec77-11ea-9244-73b589a89d9e.png commented on graphs

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/paregupt/ucs_traffic_monitor/issues/41#issuecomment-684889262, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYWIXQJFR2ZNHZRDLAW6RTSDT7E5ANCNFSM4QRYNWMQ.

andrico21 commented 4 years ago

TAC case is on its way, just trying to check UTM first - is it picking the right counters or not :)

paregupt commented 4 years ago

@andrico21 To verify the correctness of the graphs, please pick an end-point, for example Chassis-3, backplane port 2/13, and cross verify with the output of show interface priority command on FI. Make sure to look at the correct FI - A or B. You won't be able to go back in time using the show command but if the Pause activity is on-going, run the command separated by 60 or 120 secs, take a difference and compare to the graphs.

After the verification, we can dig further into the root cause. Pause frames are used for storage traffic only. Also, linking Pause frames to link-utilization is not conclusive. I have some write-up in the graphs panel description. More details are spread across but before that let's verify the accuracy.

andrico21 commented 4 years ago

Checked via CLI, TxPPP is the counter which insanely increasing (output attached). PFC_Output.txt

andrico21 commented 4 years ago

@paregupt the point about servers high CPU load was their CPU are quite overloaded processing 5Gigs of VMs trafic with VMQ disabled (just waiting for new driver release next in 1-2 months) and default 1.5K MTU... So servers have all the reasons to throw pauses, IOMs most likely not... but counters shows inverted condition - that's why I'm confused :)

p.s. TAC case is also in progress, but my experience with TAC and pauses would be better... I'm experiencing this issue for 2-3 years and there were TAC cases, now just backed by UTM as visualization solution :)

paregupt commented 4 years ago

Just to be clear, are the UTM graphs showing same data as the CLI? Is this verified now? If yes, Can you please explain the question again? Sorry, I may have missed into the details above. Finally, please contonue working with TAC. Nothing is official about UTM or my responses.

andrico21 commented 4 years ago

to be clear: current issue subject is about TxPPP and RxPPP parsing and representation in Grafana. I can confirm I see a lot of TxPPP and almost no RxPPP on Ethernet3/1/xx ports and on UTM dashboards the picture is quite the same - so I can assume even if there any issues with counters, I doubt it's UTM bug...

paregupt commented 4 years ago

Ah ok. Thanks for the clarification. You have closed the issue but, if you want, we can still discuss - Why you have Rx PAUSE on the backplane ports and what can be done about it. Few things you may want to consider:

  1. PAUSE have granularity on nanoseconds scale. The BW utilization is an average of 60 seconds. It is possible that for a few microseconds there was a large READ or WRITE IO which triggered the PAUSE. This so-called microburst may not result in high BW utilization. Please read Are you calculating bandwidth requirement from link-utilization? Think again.
  2. I have discussed the large size READ and WRITE IO are in depth during Cisco Live. Please refer to Designing Storage Networks for next decade in an All Flash Data Center - BRKDCN-2010
  3. Large READ causes Rx traffic on servers resulting in RX pause on backplane ports. Similarly, Large WRITE causes Tx traffic on servers resulting in TX pause on backplane ports.

I read your notes above again and this got my attention -- So servers have all the reasons to throw pauses, IOMs most likely not... but counters shows inverted condition - that's why I'm confused :) --

It is a complex topic and required many puzzles to be put together. There are tons of information on Ciscolive.com. Hope this helps. Please share with us your findings.

andrico21 commented 4 years ago

@paregupt I'll reply you tomorrow with some graphs, too much to explain for today. I'll glad to discuss this case unoficially - if you're ready to spend some time. Thanks for links - I'll watch it today

andrico21 commented 4 years ago

@paregupt First of all, initial problem is storage working unstable in terms of performance. Last troubleshooting session shows us some storage IO requests (write) were completed in 1.3 seconds.

AFFA300 connected via 2x4x8G to 62xx FIs cluster. No significant latencies on storage itself (controllers/LIFs/volumes or drive groups). No significant pause values on the FI ports storage connected to. The only pauses we see are RX on FI ports IOMs connected to (and these pauses graph looks pretty the same as VM disk queues and latencies graph - the same spikes, very similar form) and huge amount (thousands) of TX pauses on backplane ports in chassis. IOMs are 2208s with 4x10G uplinks each, but I see the same symptoms on my another 63xx UCS domain with 2304 IOMs 2x40G each - another storage, but also quite oversubscribed blades in terms of CPU. Anyway, storage uplink ports on FIs not showing any significant pauses, server ports on FI show RXes in hundreds.

Regarding to patterns: we always experiencing such issues (just unexplained storage slowing time by time, but I'm not able to find noisy neighbour yet - in terms of VM storage IO at least). But the most critical symptoms we observing on night time, but not during backup (which is also the first suspect checked first in any cases of storage slows on our side).

I'm continuing to dig into this case, have to watch your pres from CL (nice speaker tho! :)) and try to address it also to my network guys. I had TAC case about this a year ago and it wasn't successful, now I've running another one with no positive results yet :)

And little disclaimer about background - I'm not any kind of networking guy, but virtualization platform, so excuse me please for potential silly questions with maybe obvious answers :)

andrico21 commented 4 years ago

after some additional investigation I agree the counters are correct, but another point now - why servers generated pauses aren't visible somewhere :) but anyway - as mentioned before it's subject to investigate in existing TAC case.

paregupt commented 4 years ago

Can you please share the case number?

bsauvajon commented 1 year ago

@andrico21 did the TAC find a solution to you issue ? I'm running a very similar problem on my UCS infrastructure, with unbelievable amount of pause frames sent by the IOMs, causing a lot of troubles on the FCoE network, and spent many hours with TAC without any success. Maybe you case could be helpfull.

andrico21 commented 1 year ago

@bsauvajon hi, had no luck but realized the problem is on IOM uplinks - huge trafic between blades are just killing them, UTM helped a lot to make right conclusions. Tbh I just started to buy C-series instead of blades and these problems were gone. P.s. now already left a company, cannot say anything specific about infrastructure. But M6 C-series were great

paregupt commented 1 year ago

@bsauvajon if you want, I can take a look. Please send me an email and we can do a webex.

bsauvajon commented 1 year ago

@paregupt thanks for your proposal, I would be very happy if you could help Where can I find your email ?

paregupt commented 1 year ago

same as my github id at cisco