netscaler / netscaler-adc-metrics-exporter

Export metrics from Citrix ADC (NetScaler) to Prometheus
89 stars 33 forks source link

up metric #39

Closed rafaelpirolla closed 3 years ago

rafaelpirolla commented 3 years ago

As per: https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

It's good practice to create the up metric for the exporter like in: https://github.com/prometheus/haproxy_exporter/blob/146b612c9e13960a8c9adf0e98f50a6ad7e96e1f/haproxy_exporter.go#L321

Although it seems we have it on citrix_exporter it never seems to go to 0.

It would be great to have it set to 0 when the ADC fetch is not successful.

aroraharsh23 commented 3 years ago

I am not sure of what's the requirement here.
So, when we say scrape failed : Is this for a particular metric or like the entire failed scrape? Also, they have also mentioned "although you can’t distinguish between the exporter being down and the application being down" Won't it create confusion to begin with?

rafaelpirolla commented 3 years ago

So, when we say scrape failed : Is this for a particular metric or like the entire failed scrape?

For citrix-exporter as of now it's a bit hard to create an alert to tell that the ADC is down.

aroraharsh23 commented 3 years ago

Thanks. Will explore the feasibility and implications of adding this.

aroraharsh23 commented 3 years ago

Rafael, When you say "Although it seems we have it on citrix_exporter it never seems to go to 0."

Are you referring to process stats or something else, can you share an example for which stat you were referring to. I am assuming you refer to process_* stat should also go to zero. Is the understanding correct?

rafaelpirolla commented 3 years ago

I meant the up metric.

rafaelpirolla commented 3 years ago

Say you have this docker-compose.yml:

version: '3.4'

services:
  cpx:
    container_name: cpx
    image: quay.io/citrix/citrix-k8s-cpx-ingress:12.1-51.16
    ports:
      - 9080:9080
      - 8000:8000
    tty: true
    ulimits:
      core: -1
    environment:
      "CPX_CORES": 2
      EULA: "yes"
    privileged: true
    networks:
      - monitoring

  citrix_exporter:
    container_name: citrix_exporter
    image: quay.io/citrix/citrix-adc-metrics-exporter:1.4.6
    volumes:
      - ./citrix_exporter:/citrix_exporter
    command: --target-nsip cpx:9080 --port 9090 --config-file /citrix_exporter/config.yaml --secure no --metrics-file /citrix_exporter/metrics.json
    ports:
      - 9090:9090
    networks:
      - monitoring

networks:
  monitoring:

The metrics.json file is just getting one metric:

{
    "system": {
        "gauges": [
            ["cpuusagepcnt", "adc_cpu_usage_percent"]
        ]
    }
}

And config is setting nsroot, nsroot for pass as per this CPX version.

Then you can get the stats from CPX:

~ ❯❯❯ curl -s localhost:9090/metrics | grep -v "^#"
python_gc_objects_collected_total{generation="0"} 514.0
python_gc_objects_collected_total{generation="1"} 7.0
python_gc_objects_collected_total{generation="2"} 0.0
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
python_gc_collections_total{generation="0"} 60.0
python_gc_collections_total{generation="1"} 5.0
python_gc_collections_total{generation="2"} 0.0
python_info{implementation="CPython",major="3",minor="8",patchlevel="2",version="3.8.2"} 1.0
process_virtual_memory_bytes 2.8082176e+07
process_resident_memory_bytes 2.2921216e+07
process_start_time_seconds 1.60491923329e+09
process_cpu_seconds_total 0.37
process_open_fds 7.0
process_max_fds 1.048576e+06
adc_cpu_usage_percent{nsip="cpx:9080"} 0.0

Now we stop CPX:

~ ❯❯❯ docker stop cpx
cpx

And finally get the stats again:

~ ❯❯❯ curl -s localhost:9090/metrics | grep -v "^#"
python_gc_objects_collected_total{generation="0"} 514.0
python_gc_objects_collected_total{generation="1"} 7.0
python_gc_objects_collected_total{generation="2"} 0.0
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
python_gc_collections_total{generation="0"} 60.0
python_gc_collections_total{generation="1"} 5.0
python_gc_collections_total{generation="2"} 0.0
python_info{implementation="CPython",major="3",minor="8",patchlevel="2",version="3.8.2"} 1.0
process_virtual_memory_bytes 2.8082176e+07
process_resident_memory_bytes 2.3048192e+07
process_start_time_seconds 1.60491923329e+09
process_cpu_seconds_total 0.39999999999999997
process_open_fds 8.0
process_max_fds 1.048576e+06

How can anyone tell by any of these metrics that the exporter is up and the CPX is down?

Added after edit: The absence of the adc_cpu_usage_percent shouldn't be counted as it adds a lot of unneeded complexity in the promql side. blackbox_exporter have the probe_success metric, for example. haproxy_exporter I think uses the up metric. You could create some configurable prefix/name (please don't hardcode citrix_up) for the needed metric.

rafaelpirolla commented 3 years ago

Probably a change to this function? https://github.com/citrix/citrix-adc-metrics-exporter/blob/15f435bad8b3bef42367caf4be5b954a9d7b1415/exporter.py#L535

aroraharsh23 commented 3 years ago

Thanks for the input. we'll understand the impact and do the needful.

aroraharsh23 commented 3 years ago

We can add the Up metric or Probe success metric but that alone won't suffice. Just the presence of only the below-mentioned counters won't necessarily mean that ADC is essentially DOWN while the exporter is UP.

Because you might see the below-mentioned metrics when the scrape interval is not long enough for the exporter to fetch all the counters and we have no option but to return the below-mentioned counters for the next incoming requests till the original request is served. In that case, the ADC would still be UP and we can't mark ADC DOWN for rejected requests.

To explain, let's say 1st request takes 45 sec for a huge ADC Config and the scrape interval is let say 10 sec. So, that means while we process the 1st request, the subsequent 2nd(at 10th sec), 3rd(20the sec), 4th(30th sec), 4th(40th sec) actually can't get served and we return the below-mentioned metrics till 1st metric is complete. The user then needs to tune his scrape interval to 45-50 sec. There's no option to keep piling the subsequent requests in the above case.

So, in my opinion, just setting the Up metric won't be good. Either we also have to add 1 more custom metric, let's say scrape_time_elapsed YES/NO. or Leave the current implementation as it is.

~ ❯❯❯ curl -s localhost:9090/metrics | grep -v "^#" python_gc_objects_collected_total{generation="0"} 514.0 python_gc_objects_collected_total{generation="1"} 7.0 python_gc_objects_collected_total{generation="2"} 0.0 python_gc_objects_uncollectable_total{generation="0"} 0.0 python_gc_objects_uncollectable_total{generation="1"} 0.0 python_gc_objects_uncollectable_total{generation="2"} 0.0 python_gc_collections_total{generation="0"} 60.0 python_gc_collections_total{generation="1"} 5.0 python_gc_collections_total{generation="2"} 0.0 python_info{implementation="CPython",major="3",minor="8",patchlevel="2",version="3.8.2"} 1.0 process_virtual_memory_bytes 2.8082176e+07 process_resident_memory_bytes 2.3048192e+07 process_start_time_seconds 1.60491923329e+09 process_cpu_seconds_total 0.39999999999999997 process_open_fds 8.0 process_max_fds 1.048576e+0

rafaelpirolla commented 3 years ago

Sure, that's usually also made: https://prometheus.io/docs/instrumenting/writing_exporters/#metrics-about-the-scrape-itself

Note that you should report that the adc_exporter_probe (naming is hard, right?) metric is 0 only when the scrape fails.

Scrapes can actually occur in parallel that's why they suggest using new metrics every time: https://prometheus.io/docs/instrumenting/writing_exporters/#collectors And that's how if the 1st request takes 45s it's ok, since after this first 45s you'll get new metrics every 10s - as per your example. I would not try this low scraping intervals in a production ADC though because it actually kills it as management cpu goes above 80%...

rafaelpirolla commented 3 years ago

This is the only metric missing to demise the legacy monitoring solution. :(

aroraharsh23 commented 3 years ago

Parallel scraping would require a complete overhaul, which is quite a stretch for the requirement. If we go just for the Up metric, then it has to be hard-coded to 0 or 1 depending on whether ADC is accessible or not(as we'll only get a nitro ERROR for accessibility), which again you have mentioned that you are not ok with. Thus, we haven't decided to take this up so far. We'll discuss again for a workaround.

rafaelpirolla commented 3 years ago

I'm manually keeping the scrape interval below average response time and I'm fine with it - so must be everyone using this exporter. I would be happy with a hardcoded var that meant that it was not possible to scrape ADC in a configurable timeout interval.

aroraharsh23 commented 3 years ago

If we go ahead with the below metric, does this suffice your requirement?

When ADC is accessible

HELP citrixadc_adc_up adc_up TYPE citrixadc_adc_up gauge citrixadc_adc_up{citrixadc_access_status="UP",nsip="10.106.172.21"} 1.0

When ADC is not accessible

HELP citrixadc_adc_up adc_up TYPE citrixadc_adc_up gauge citrixadc_adc_up{citrixadc_access_status="DOWN",nsip="10.106.172.11"} 0.0

Above is not for scraping timeout or anything, just for ADC probe to keep it simple. The timeout can be handled by increasing scrape time.

1 Doubt: If the user gives an incorrect username/password, what should be the status here? As ADC will give a nitro Error that means it's accessible, so should be set to UP or as the scrape failed for whatever reason so DOWN ?

rafaelpirolla commented 3 years ago

I think that a better name for the metric could be adc_probe_success. Set it to 1 on success and 0 on whatever failures - be it timeout or wrong auth. Personally I would go without the label citrixadc_access_status... There is no good way to process actions based on labels/strings that I am aware of.

aroraharsh23 commented 3 years ago

Ok, then for any kind of probe failure, it will be:

HELP citrixadc_probe_success probe_success TYPE citrixadc_probe_success gauge citrixadc_probe_success{nsip="10.106.172.21"} 0.0

else : HELP citrixadc_probe_success probe_success TYPE citrixadc_probe_success gauge citrixadc_probe_success{nsip="10.106.172.21"} 1.0

Will release a new version soon.

rafaelpirolla commented 3 years ago

Thank you!

rafaelpirolla commented 3 years ago

I think it would be good to document this on the stats exported by default part of the README.

aroraharsh23 commented 3 years ago

Updated.