Open paregupt opened 4 years ago
We have a relatively large number of UCS instances and the instances themselves are on the large side. This tool looks fantastic but unfortunately is currently useless for us as the tool creates so many sessions that 1) the data seems to simply time out on the pull and 2) the tool eventually locks itself out of the UCSes because there are so many open sessions
Can you please run the tool manually with one UCS domain at a time? The log file shows the duration it takes to finish the poll. Please share the output.
One of the workarounds should work. Please do not forget to use latest version of ucs_traffic_monitor.py
I was able to mini-reverse engineer it and got the set-up working. I broke out 2 datacenters' worth of UCSes (about 25 instances) into 8 different files (based on environment) and then added each one of those groupings into the telegraf.conf file. Also took me a bit to realize I had to manually run the script with the verification switches and change the ownership of both the txt files containing UCSM hostnames AND the log output
Good to know.
Few more thoughts:
If this does not work, you probably would have to assign 1 UCS domain per input file and run this as a separate process in Telegraf. Worst case, you will have as many processes in telegraf and input files as the number of UCS domains. This is a one time step only.
Unfortunately, there is no other way to know the response time from a UCS domain until you run the tool. Based on many factors, scale being one, different domains take different time to respond.
In some environments, I have seen that 2 domains in their own telegraf process (1 per input file) take 20 seconds each. But when the same 2 domains are put in a single input file, collectively it takes 100 seconds. This should not happen because the tool is multithreaded and accesses the domains in parallel. I do not have conclusive explanation but it may be because of the base OS limits or multi threading implementation of Python.
Currently, stats are pulled every 60 seconds so we want the polling to finish under that. If not, the polling can be done every 120 seconds. This is last resort but having data at 2min granularity is better than no data.
I adjusted the data pull time to 5 minutes and the largest grouping (7 UCS domains) in the remote datacenter takes around 160 seconds. To be honest, while "more datapoints" is sometimes helpful to find those weird spikes, 5 minute data provides more than enough datapoints to be able to find the bottlenecks.
While I have you, I tried to find if there was a built-in retention policy for the InfluxDB but was unable to find anything. Will old data automatically be purged? Obviously as we scale up (though 1/5 of the data pulls), the database is going to grow at a good clip
If you want, please try 1 UCS domain per input file and run this as a separate process in Telegraf. This provides best accuracy because the datarate calculation uses 60 seconds by default. You can adjust these variables but adds in additional complexity in the long run.
I have a plan on data retention and roll-up policy and plan to write the details in next 1-2 weeks. Curious, how many domains and servers do you have? Based on other deployments, I can provide you an estimate of per day storage requirement.
I think we're at around 27 domains (and not purchasing any more). They tend to be larger; anywhere between 10 and 16 chassis each.
Roughly 500 MB/day. With 200 GB, you should be able to retain data for a year. If you have additional thoughts on retention and roll up policy, you are welcome to raise another issue. There may not a single model that fits all but I am looking for thoughts that works for most.
Let’s keep this issue focused and Please keep posted If you happen to try one domain per input file/telegraf process. It’s a one time change and will give true picture.
I added some more lines to the telegraf.conf code as I have 12 domains and they time out all the time as it takes a long time to process them so wanted to split them up into multiple txt files. So I added the below created the relevant txt file, gave the process full permissions but it doesn't seem to run. I am new to this so still learning any help would be great.
The other question is if you remove a domain how do you flush it so it doesn't appear on the website any longer?
[[inputs.exec]] interval = "60s" commands = [ "python3 /usr/local/telegraf/ucs_traffic_monitor.py /usr/local/telegraf/ucs_domains_group_3.txt influxdb-lp -vv", ] timeout = "60s" data_format = "influx"
The telegraf.conf config looks good. It's probably have to do something with the permission of the input files. Please check and ensure that the permissions of the new files is the same as the permissions of the existing files. Overall, the files must be owned by telegraf user and group wheel.
Also, ensure the log files in /var/log/telegraf/ucs* have the same permissions.
If it still doesn't work, please reach out.
Thanks, Here is a summary of what i have done:
Is there anywhere else I can see the process logs?
That looks good. Minor point: you don't have to manually create the log file. It will be automatically created.
Did you restart the telegraf process? - systemctl restart telegraf
If nothing works, try running it manually.
sudo python3 ucs_traffic_monitor.py ucs_domains_group_3.txt dict -vvv
If you want, consider starting a new thread. We don't want to spam @thomsonac
Thanks restarting the process seem to kick it into life. Thanks for your help! Great product btw
telegraf timer is set at 50 seconds (by default). Some times the collector takes longer than that, resulting in the telegraf process killing ucs_tarffic_monitor.py. This is not a graceful cleanup. The next invocation may open a new open session, soon resulting in the error:
UcsException : [ErrorCode]: 572[ErrorDescription]: User reached maximum session limit
Why it happens? It depends on the UCS domain. If the system is busy, it may take longer to respond. Monitor the login time if using remote authentication. Sometimes the remote login server (LDAP, AAA, etc.) may introduce multiple sections of delay.
Approaches to handle this:
Notes: