netscaler / netscaler-adc-metrics-exporter

Export metrics from Citrix ADC (NetScaler) to Prometheus
90 stars 32 forks source link

fix test/login/logout eternal loops and small refactor around session #37

Closed rafaelpirolla closed 4 years ago

rafaelpirolla commented 4 years ago

although contribution outside citrix are not being accepted it may be interesting to check the changes I did.

i removed the initial checks because i don't see the point to check if you are able to connect to the adc in an eternal loop, also removed the loop from login and logout as it can kill the management CPU by the 1 second rate defined. this code tries to authenticate 3 times and exit if not possible. i have created a systemd service file that tries to restart the exporter every 5 minutes then - these kind of logic should be outside the exporter.

for issues when getting stats it would be awesome to reply a 500 or reply that the exporter is down with the up metric - i'm still not sure how to handle parcial failures though (e.g. can get some stats, but not all).

as you only use get and post i override these methods to take timeout from the session declaration so that we can have a clear retry, backoff and timeout policy in place.

as I am no expert python dev there can be idiomatic issues, for sure. bonus: works with python3

aroraharsh23 commented 4 years ago

Thanks for your suggestions, will review this and incorporate some of it if it improves performance. We already had python3. version planned for next release.

rafaelpirolla commented 4 years ago

performance wise is the same, i think that a switch from requests to httpx would improve performance. the issue is that the infinite loop of initial check and login/logout is really bad for the adc.

vvoody commented 4 years ago

Thanks @rafaelpirolla ! Exit after retrying is good mechanism to protect ADC.

We also met the sessions leak issue after running exporter for 1 week. The exporter kept retrying logout, and "show system session" was full of exporter's sessions.

@aroraharsh23 , looking forward to your improvement :D

aroraharsh23 commented 4 years ago

@vvoody we are reviewing some of above changes with long run tests. Will update the release soon.

aroraharsh23 commented 4 years ago

@vvoody we are somehow unable to reproduce the scenario wherein you observed "sessions leak issue ". Could you share some details about your deployment: 1) Is this an individual exporter container or a k8s pod ?
2) What is the Prometheus scraping rate/timeout ? 3) Are you scraping all the metrics in metrics.json or you have specified some particular metric only 4) A gist of your ADC config that you were scraping. 5) Apart from exporter, are there other entities which were frequently accessing ADC, which might lead to conflict. 6) Does this happen frequently or only during long-run?

Above details, will help to resolve the issue.

vvoody commented 4 years ago
  1. Is this an individual exporter container or a k8s pod ?

Individual container on a Linux host. I started one exporter container for each ADC, total about 70~80.

docker run -dt --restart unless-stopped -p 10001:10001 -v $PWD/config.yaml:/exporter/config.yaml -v $PWD/metrics.json:/exporter/metrics.json --name exporter-adc001 citrix-adc-metrics-exporter:1.4.3 --port=10001 --config-file=/exporter/config.yaml --metrics-file=/exporter/metrics.json --target-nsip=adc001.mycorp.com
docker run -dt --restart unless-stopped -p 10002:10002 -v $PWD/config.yaml:/exporter/config.yaml -v $PWD/metrics.json:/exporter/metrics.json --name exporter-adc002 citrix-adc-metrics-exporter:1.4.3 --port=10001 --config-file=/exporter/config.yaml --metrics-file=/exporter/metrics.json --target-nsip=adc002.mycorp.com
  1. What is the Prometheus scraping rate/timeout ?
global:
  scrape_interval: 30s
  scrape_timeout:  10s

  external_labels:
    monitor: 'prometheus-monitor'

scrape_configs:
- job_name: sslsessions

  static_configs:
  # put all exporter of all LB in targets
  - targets: ['host1:10001', 'host1:10002', 'host2:20001', 'host3:30001', ..............]
  1. Are you scraping all the metrics in metrics.json or you have specified some particular metric only

metrics.json:

{
    "ssl": {
          "counters": [
              ["ssltotsslv3sessions", "citrixadc_ssl_tot_v3_sessions"],
              ["ssltotsessions", "citrixadc_ssl_tot_sessions"]
          ]
      }
}
  1. A gist of your ADC config that you were scraping.

Just total SSL and SSLv3 sessions on ADC.

  1. Apart from exporter, are there other entities which were frequently accessing ADC, which might lead to conflict.

Yes, we have other automation tools accessing ADC. But they don't obtain the metrics I care by Nitro API.

  1. Does this happen frequently or only during long-run?

It happened after running for a few days. The first and second days were good. But one day I checked status of prometheus targets, found many were down. So I checked the docker logs of them, massive logs about exporter logout failure ("Retrying to Logout of citrix adc"). Then I logged in to ADC, run "show system session", about 40+ out of 50 sessions were caused by exporter.


Some thoughts

Exporter uses one new session to get everything and then must logout to continue next scape. Although logout might fail, ADC should only have one session from exporter.

When exporter container is restarted for some reasons, because I use --restart unless-stopped, new session is created. The old session caused by previous logout failure still exist on ADC until expires. Considering of logout failure and exporter container restarted, sessions accumulate on ADC.

Default session timeout of a Nitro session is 30 minutes. Maybe we could decrease it for each scrape.

Or we might cache the Nitro session token for continuous use, it can also avoid doing logout every time. Because exporter scrapes frequently. Of course, it should handle token expiration.

aroraharsh23 commented 4 years ago

Some of the suggestions have been incorporated post 1.4.5, Thanks for input. Closing this.