netscaler / netscaler-adc-metrics-exporter

Export metrics from Citrix ADC (NetScaler) to Prometheus
89 stars 33 forks source link

Failed to fetch some metrics #15

Closed mveroone closed 5 years ago

mveroone commented 5 years ago

Hi,

I've followed the README, added user & policy to my Netscaler, installed this exporter in my Rancher cluster (docker orchestrator) and I get some metrics, although not all, and specifically, missing those the ones I was looking up to (service & servicegroup up/down status).

here is the debug log from the exporter.

8/2/2019 3:09:23 PM2019-08-02T13:09:23+0000 WARNING  Could not collect metric: 'servicegroupmember'
8/2/2019 3:09:23 PM2019-08-02T13:09:23+0000 INFO     Collecting metric ns for w.x.y.z:443
8/2/2019 3:09:23 PM2019-08-02T13:09:23+0000 INFO     Collecting metric lbvserver for w.x.y.z:443
8/2/2019 3:09:23 PM2019-08-02T13:09:23+0000 INFO     Collecting metric protocolip for w.x.y.z:443
8/2/2019 3:09:23 PM2019-08-02T13:09:23+0000 INFO     Collecting metric nscapacity for w.x.y.z:443
8/2/2019 3:09:23 PM2019-08-02T13:09:23+0000 INFO     metrices for lbvserver with k8sprefix "VIP" are not fetched
8/2/2019 3:09:27 PM2019-08-02T13:09:27+0000 INFO     Collecting metric protocoltcp for w.x.y.z:443
8/2/2019 3:09:27 PM2019-08-02T13:09:27+0000 INFO     Collecting metric aaa for w.x.y.z:443
8/2/2019 3:09:27 PM2019-08-02T13:09:27+0000 INFO     Collecting metric service for w.x.y.z:443
8/2/2019 3:09:27 PM2019-08-02T13:09:27+0000 WARNING  Could not collect metric: u'service'
8/2/2019 3:09:27 PM2019-08-02T13:09:27+0000 INFO     Collecting metric csvserver for w.x.y.z:443
8/2/2019 3:09:28 PM2019-08-02T13:09:28+0000 INFO     Collecting metric Interface for w.x.y.z:443
8/2/2019 3:09:28 PM2019-08-02T13:09:28+0000 INFO     Collecting metric system for w.x.y.z:443
8/2/2019 3:09:28 PM2019-08-02T13:09:28+0000 INFO     Collecting metric protocolhttp for w.x.y.z:443
8/2/2019 3:09:28 PM2019-08-02T13:09:28+0000 INFO     Collecting metric ssl for w.x.y.z:443
8/2/2019 3:09:28 PM2019-08-02T13:09:28+0000 INFO     Collecting metric services for w.x.y.z:443

When switching to HTTP & doing a network trace, I can see that the NS answers 200 to most requests, and that the answer seems correct, so that's the exporter unable to process them . for example :

 curl https://user:password@w.x.y.z/nitro/v1/stat/servicegroup/ServiceGroup_Name?statbindings=yes
{ "errorcode": 0, "message": "Done", "severity": "NONE", "servicegroup": [ { "servicegroupname": "ServiceGroup_Name", "state": "ENABLED", "servicetype": "HTTP" } ] 

Could it be an incompatibility with my hardware/firmware ?
Hardware : NSMPX-8000-10G
Firmware : NS12.1 Build 52.15
Exporter versions : 1.0.7 and/or latest (found no release note / version history)

aroraharsh23 commented 5 years ago

Hi @mveroone , Can you confirm the version in which you are observing this issue?

mveroone commented 5 years ago

Hi. I've tested both the 1.0.7 and latest tags of the docker image.

aroraharsh23 commented 5 years ago

@mveroone i have updated version "1.0.8" wherein you can find status(UP/DOWN) of services/servicegroupmembers as labels(for string values, it's the only option), same is also reflected in Grafana, with other metrics for these 2 also working now. Let me know if this resolves your issue. Also, we have removed "latest" version. "1.0.8" version itself is the most updated one.

mveroone commented 5 years ago

Thanks for the answer, i'll try.

Where can I find release notes of these versions ?
I'd suggest keep the "releases" page of Github up to date if there isn't one.

aroraharsh23 commented 5 years ago

Yes, will add that section soon.

mveroone commented 5 years ago

Actually , it doesn't work with 1.0.8 either :

INFO     Collecting metric protocoltcp for w.x.y.z:443
INFO     Collecting metric aaa for w.x.y.z:443
INFO     Collecting metric service for w.x.y.z:443
WARNING  Could not collect metric: u'service'
INFO     Collecting metric csvserver for w.x.y.z:443
INFO     Collecting metric Interface for w.x.y.z:443
INFO     Collecting metric system for w.x.y.z:443
INFO     Collecting metric protocolhttp for w.x.y.z:443
INFO     Collecting metric ssl for w.x.y.z:443
INFO     Collecting metric services for w.x.y.z:443
WARNING  Could not collect metric: 'servicegroupmember'
INFO     Collecting metric ns for w.x.y.z:443
INFO     Collecting metric lbvserver for w.x.y.z:443
INFO     Collecting metric protocolip for w.x.y.z:443
INFO     Collecting metric nscapacity for w.x.y.z:443
INFO     metrices for lbvserver with k8sprefix "VIP" are not fetched
WARNING  Counter stats for totalsvrbusyerr not enabled in netscalar w.x.y.z:443, so could not add to lbvserver
WARNING  Gauge stats for svrbusyerrrate not enabled in netscalar w.x.y.z:443, so could not add to lbvserver

Image ID used (for 1.0.8) is sha256:10e5ea7599889214cab6e31c62240c66b28639957c97edd1cff073f03487acab

(servicemember error seems to be fixed by your latest commit id bb942aedfa57140133c0d9d6af0148deb456b120 )

aroraharsh23 commented 5 years ago

@mveroone Looks like some version mismatch, my digest is not the same which you pulled. Just for clarity i have created new version 1.0.9. with Image Id : sha256:ca5def19416ceaa9aeb85e9f92f3111b71f67050a84e7861bfee5933f4ec1dbe You can check, if this fine with you as well.

mveroone commented 5 years ago

Warning, image digest and image ID are different hashes. Digest can change when retagging an image, while the ID changes with the image content only.

Here is how I check image ID :

root@docker:~# docker images --no-trunc | grep citrix
REPOSITORY                                           TAG                    IMAGE ID                                                                  CREATED             SIZE
<our repository>/citrix/netscaler-metrics-exporter   1.0.9                  sha256:32b710af16e67df4ee0ba7d4fcc712ccb5dbea53bcfde02dc726dd71e0898d74   About an hour ago   63.2MB
<our repository>/citrix/netscaler-metrics-exporter   1.0.8                  sha256:10e5ea7599889214cab6e31c62240c66b28639957c97edd1cff073f03487acab   3 hours ago         63.2MB

Note : I've tried both to proxy your registry through ours or just copy the image (using docker pull/tag/push) with he same results

aroraharsh23 commented 5 years ago

Thanks for confirming, so, as per your testing, you able to see all your issues resolved with the latest commits, that is if you just run the python exporter.py script ? and don't use the image . That way, we can narrow down, if it's the image issue or some other fix is needed for your environment. Reason i am requesting you to test with just the python script is as, i was also getting the same issues earlier but after fixing with latest commits, issue is no longer observed here. So, we can just focus on getting the image sorted.

aroraharsh23 commented 5 years ago

@mveroone

mveroone commented 5 years ago

Hi @aroraharsh23,

I've confirmed the behaviour is identical running outside of docker :

For the record, your Dockerfile could be tuned a lot to decrease the layer amount and clean up unused cache files, but that is unrelated, I'll open another issue later.

PS : no need to tag me in comments, i've subscribed to the issue and receive updates via email.

aroraharsh23 commented 5 years ago

Can we have a web session to resolve this? You can suggest suitable time for tomorrow as per your convenience. My timezone: IST

mveroone commented 5 years ago

Sure, but be advised I only have read-only access to our Netscaler MPX CLI and no GUI access. please email me details => xxx@yyy.com (will remove address from this comment once you have it to avoid later spam)

My timezone is CEST, which is 3.30 hours earlier than yours, so I'd suggest tomorrow (August 7th) 10.30 CEST == 14.00 IST.

Thanks for your time & patience

aroraharsh23 commented 5 years ago

I have emailed a GTM link for 14.00 IST Aug 7.

mveroone commented 5 years ago

Hi, After a few trial and error, you were right, I had a servicegroup without any backend attached(whether it's a service or a vserver). While i'm going to have a look at why that configuration exist, which makes no sense, could you please consider the following patch ?

diff --git a/exporter.py b/exporter.py
index 30c96e8..39487b5 100755
--- a/exporter.py
+++ b/exporter.py
@@ -69,7 +69,7 @@ def collect_data(nsip, entity, username, password, protocol, nitro_timeout):
                 # get dict with stats of all services bound to a particular servicegroup
                 r = requests.get(url, headers=headers, verify=False, timeout=nitro_timeout)
                 data_tmp = r.json()
-                if data_tmp['errorcode'] == 0:
+                if data_tmp['errorcode'] == 0 and  'servicegroupmember' in data_tmp['servicegroup'][0]:
                     # create a list with stats of all services bound to NS of all servicegroups
                     for individual_servicebinding_data in data_tmp['servicegroup'][0]['servicegroupmember']:
                         # manually adding key:value '_manual_servicegroup_name':_manual_servicegroup_name to stats of a particular 
aroraharsh23 commented 5 years ago

I reviewed your patch but i don't think that it should be merged as with this change, user will not get any notification that there are no members in the service group.

I am saying this as the 'servicegroup' or "Services" stats are primarily meant for the members and with this change, no warning will be thrown regarding that entity is actually not present. It will show that 'Services' stats are properly fetched while they won't be as all the counters are for the 'members' only. So, current code is fine.

Let me know if you have other opinion or we can close this issue.

mveroone commented 5 years ago

I understand your position, that's fine.
Thanks for you support !