paregupt / ucs_traffic_monitor

Cisco UCS traffic monitoring using Grafana, InfluxDB and Telegraf
MIT License
81 stars 24 forks source link

Stats pull taking longer than timeout #8

Open paregupt opened 4 years ago

paregupt commented 4 years ago

telegraf timer is set at 50 seconds (by default). Some times the collector takes longer than that, resulting in the telegraf process killing ucs_tarffic_monitor.py. This is not a graceful cleanup. The next invocation may open a new open session, soon resulting in the error:

UcsException : [ErrorCode]: 572[ErrorDescription]: User reached maximum session limit

Why it happens? It depends on the UCS domain. If the system is busy, it may take longer to respond. Monitor the login time if using remote authentication. Sometimes the remote login server (LDAP, AAA, etc.) may introduce multiple sections of delay.

Approaches to handle this:

  1. Simple approach: Monitor the logs. If this happens, create multiple telegraf process and spread the UCS domains across multiple ucs_domains.txt files.
  2. Enhancement: Monitor this condition and self-kill gracefully from ucs_traffic_monitor.py itself before telegraf kills it. Requires a timer thread to monitor the timeout and cleanup if the process does not get over before timeout.
  3. Advanced resolution: automatically spawn multiple processes of telegraf. May be super-cool but requires effort. Any takers?
  4. Advanced resolution: UCS keeps metrics for 5 polling intervals. Instead of polling every 60 seconds (current design), poll every 5 minutes for the last 5 records. Then, update the database with the data points every 60 seconds. Pros: less polling frequency from UCS and potential resolution of #8. Cons: Complicated receiver design and more data pull at every poll.
  5. Suggestions?

Notes:

  1. Netmiko ConnectHandler timeout gets invoked after the session is up. Need an approach to account of the overall timeout, including connection setup time and cli executing time.
  2. For SDK connection, account for login and query timeout.
thomsonac commented 4 years ago

We have a relatively large number of UCS instances and the instances themselves are on the large side. This tool looks fantastic but unfortunately is currently useless for us as the tool creates so many sessions that 1) the data seems to simply time out on the pull and 2) the tool eventually locks itself out of the UCSes because there are so many open sessions

image

paregupt commented 4 years ago

Can you please run the tool manually with one UCS domain at a time? The log file shows the duration it takes to finish the poll. Please share the output.

One of the workarounds should work. Please do not forget to use latest version of ucs_traffic_monitor.py

thomsonac commented 4 years ago

I was able to mini-reverse engineer it and got the set-up working. I broke out 2 datacenters' worth of UCSes (about 25 instances) into 8 different files (based on environment) and then added each one of those groupings into the telegraf.conf file. Also took me a bit to realize I had to manually run the script with the verification switches and change the ownership of both the txt files containing UCSM hostnames AND the log output

paregupt commented 4 years ago

Good to know.

Few more thoughts:

If this does not work, you probably would have to assign 1 UCS domain per input file and run this as a separate process in Telegraf. Worst case, you will have as many processes in telegraf and input files as the number of UCS domains. This is a one time step only.

Unfortunately, there is no other way to know the response time from a UCS domain until you run the tool. Based on many factors, scale being one, different domains take different time to respond.

In some environments, I have seen that 2 domains in their own telegraf process (1 per input file) take 20 seconds each. But when the same 2 domains are put in a single input file, collectively it takes 100 seconds. This should not happen because the tool is multithreaded and accesses the domains in parallel. I do not have conclusive explanation but it may be because of the base OS limits or multi threading implementation of Python.

Currently, stats are pulled every 60 seconds so we want the polling to finish under that. If not, the polling can be done every 120 seconds. This is last resort but having data at 2min granularity is better than no data.

thomsonac commented 4 years ago

I adjusted the data pull time to 5 minutes and the largest grouping (7 UCS domains) in the remote datacenter takes around 160 seconds. To be honest, while "more datapoints" is sometimes helpful to find those weird spikes, 5 minute data provides more than enough datapoints to be able to find the bottlenecks.

While I have you, I tried to find if there was a built-in retention policy for the InfluxDB but was unable to find anything. Will old data automatically be purged? Obviously as we scale up (though 1/5 of the data pulls), the database is going to grow at a good clip

paregupt commented 4 years ago

If you want, please try 1 UCS domain per input file and run this as a separate process in Telegraf. This provides best accuracy because the datarate calculation uses 60 seconds by default. You can adjust these variables but adds in additional complexity in the long run.

I have a plan on data retention and roll-up policy and plan to write the details in next 1-2 weeks. Curious, how many domains and servers do you have? Based on other deployments, I can provide you an estimate of per day storage requirement.

thomsonac commented 4 years ago

I think we're at around 27 domains (and not purchasing any more). They tend to be larger; anywhere between 10 and 16 chassis each.

paregupt commented 4 years ago

Roughly 500 MB/day. With 200 GB, you should be able to retain data for a year. If you have additional thoughts on retention and roll up policy, you are welcome to raise another issue. There may not a single model that fits all but I am looking for thoughts that works for most.

Let’s keep this issue focused and Please keep posted If you happen to try one domain per input file/telegraf process. It’s a one time change and will give true picture.

neddyuk commented 3 years ago

I added some more lines to the telegraf.conf code as I have 12 domains and they time out all the time as it takes a long time to process them so wanted to split them up into multiple txt files. So I added the below created the relevant txt file, gave the process full permissions but it doesn't seem to run. I am new to this so still learning any help would be great.

The other question is if you remove a domain how do you flush it so it doesn't appear on the website any longer?

[[inputs.exec]] interval = "60s" commands = [ "python3 /usr/local/telegraf/ucs_traffic_monitor.py /usr/local/telegraf/ucs_domains_group_3.txt influxdb-lp -vv", ] timeout = "60s" data_format = "influx"

paregupt commented 3 years ago

The telegraf.conf config looks good. It's probably have to do something with the permission of the input files. Please check and ensure that the permissions of the new files is the same as the permissions of the existing files. Overall, the files must be owned by telegraf user and group wheel.

Also, ensure the log files in /var/log/telegraf/ucs* have the same permissions.

If it still doesn't work, please reach out.

neddyuk commented 3 years ago

Thanks, Here is a summary of what i have done:

Is there anywhere else I can see the process logs?

paregupt commented 3 years ago

That looks good. Minor point: you don't have to manually create the log file. It will be automatically created. Did you restart the telegraf process? - systemctl restart telegraf If nothing works, try running it manually.

sudo python3 ucs_traffic_monitor.py ucs_domains_group_3.txt dict -vvv

If you want, consider starting a new thread. We don't want to spam @thomsonac

neddyuk commented 3 years ago

Thanks restarting the process seem to kick it into life. Thanks for your help! Great product btw