Better errors for troubleshooting

7840vz commented 7 months ago

Hi! We have around 200+ Lenovo servers, same model, same firmware, same site. All monitored by idrac_exporter instance. 190 out of 200 reports everything right. However, around 10 servers report just host unreachable after 3 retries:

2024-05-06T17:22:03+03:00   2024-05-06T14:22:03.650 ERROR Error instantiating metrics collector for host xx.xx.10.163: host unreachable after 3 retries
2024-05-06T17:22:03+03:00   2024-05-06T14:22:03.650 DEBUG Handling request from idrac-exporter:9348 for host xx.xx.10.163

Thing is that we can successfully poll this server's Redfish API with wget by using same login/pass FROM the same host where idrac_exporter is, with no problem: So it is not a firewall or wrong password used:

wget -U - --no-check-certificate https://<user>:<pass>@<ip>:443/redfish/v1/Systems/1/Memory

{"Name":"Memory Collection","@odata.type":"#MemoryCollection.MemoryCollection","Members":[{"@odata.id":"/redfish/v1/Systems/1/Memory/1"},{"@odata.id":"/redfish/v1/Systems/1/Memory/2"},{"@odata.id":"/redfish/v1/Systems/1/Memory/3"},{"@odata.id":"/redfish/v1/Systems/1/Memory/4"},{"@odata.id":"/redfish/v1/Systems/1/Memory/5"},{"@odata.id":"/redfish/v1/Systems/1/Memory/6"},{"@odata.id":"/redfish/v1/Systems/1/Memory/7"},{"@odata.id":"/redfish/v1/Systems/1/Memory/8"},{"@odata.id":"/redfish/v1/Systems/1/Memory/9"},{"@odata.id":"/redfish/v1/Systems/1/Memory/10"},{"@odata.id":"/redfish/v1/Systems/1/Memory/11"},{"@odata.id":"/redfish/v1/Systems/1/Memory/12"},{"@odata.id":"/redfish/v1/Systems/1/Memory/13"},{"@odata.id":"/redfish/v1/Systems/1/Memory/14"},{"@odata.id":"/redfish/v1/Systems/1/Memory/15"},{"@odata.id":"/redfish/v1/Systems/1/Memory/16"},{"@odata.id":"/redfish/v1/Systems/1/Memory/17"},{"@odata.id":"/redfish/v1/Systems/1/Memory/18"},{"@odata.id":"/redfish/v1/Systems/1/Memory/19"},{"@odata.id":"/redfish/v1/Systems/1/Memory/20"},{"@odata.id":"/redfish/v1/Systems/1/Memory/21"},{"@odata.id":"/redfish/v1/Systems/1/Memory/22"},{"@odata.id":"/redfish/v1/Systems/1/Memory/23"},{"@odata.id":"/redfish/v1/Systems/1/Memory/24"},{"@odata.id":"/redfish/v1/Systems/1/Memory/25"},{"@odata.id":"/redfish/v1/Systems/1/Memory/26"},{"@odata.id":"/redfish/v1/Systems/1/Memory/27"},{"@odata.id":"/redfish/v1/Systems/1/Memory/28"},{"@odata.id":"/redfish/v1/Systems/1/Memory/29"},{"@odata.id":"/redfish/v1/Systems/1/Memory/30"},{"@odata.id":"/redfish/v1/Systems/1/Memory/31"},{"@odata.id":"/redfish/v1/Systems/1/Memory/32"}],"Description":"A collection of memory resource instances.","@odata.etag":"\"da843bae3cfd30e805a46\"","Members@odata.count":32,"@odata.id":"/redfish/v1/Systems/1/Memory","Oem":{"Lenovo":{"HistoryMemMetric":{"@odata.id":"/redfish/v1/Systems/1/Memory/Oem/Lenovo/HistoryMemMetric"},"HistoryMemRecovery":{"@odata.id":"/redfish/v1/Systems/1/Memory/Oem/Lenovo/HistoryMemRecovery"}}},"@odata.context":"/redfish/v1/$metadata#MemoryCollection.MemoryCollection"}

We also set timeouts around 55s, and scraping normally takes around 5-15 seconds.

Right now we are clueless what is wrong with our setup and these 10 boxes. Is there a way to add more details what went wrong when connection to server has failed? Maybe HTTP response code, creds used (default/nondefault etc?)

Thanks

mrlhansen commented 7 months ago

Hi @7840vz

When using the -verbose flag you should get more information about what is going wrong. In the redfishGet() function there are debug statements if the HTTP request fails, if the return code is not OK, or if the data parsing fails. Do you see any of these in the log?

Edit: Just in case it's not obvious. Once it returns "host unreachable after 3 retries" the exporter will not try the target again, so if the hosts were temporarily unreachable when the exporter was started, it either has to be restarted, or the targets have to be reset using the /reset endpoint.

7840vz commented 7 months ago

Hi @mrlhansen , thanks for the reply. as of -verbose flag, we already using that, so when everything is fine we can see lots of calls to different API handles.

Unfortunately, it won't shed light what is wrong if metrics are not collected.

For example we found one host where password was actually wrong:

2024-05-07T09:54:15+03:00   2024-05-07T06:54:15.353 ERROR Error instantiating metrics collector for host xx.xx.10.105: host unreachable after 3 retries
2024-05-07T09:54:15+03:00   2024-05-07T06:54:15.353 DEBUG Handling request from idrac-exporter:9348 for host xx.xx.10.105

There is also one host where nonexistent IP being used in inventory DB:

2024-05-07T09:57:45+03:00   2024-05-07T06:57:45.716 ERROR Error instantiating metrics collector for host xx.xx.1.230: host unreachable after 3 retries
2024-05-07T09:57:45+03:00   2024-05-07T06:57:45.716 DEBUG Handling request from idrac-exporter:9348 for host xx.xx.1.230

Errors are absolutely the same, so shedding some light in the first case that problem is due to 401 Unauthorized, while hinting in second case that problem is actually because of network timeout would help a lot, IMHO.

7840vz commented 7 months ago

Edit: Just in case it's not obvious. Once it returns "host unreachable after 3 retries" the exporter will not try the target again, so if the hosts were temporarily unreachable when the exporter was started, it either has to be restarted, or the targets have to be reset using the /reset endpoint.

That is a surprise. For us it is a big problem, as we cannot guarantee that all servers are up when idrac_exporter was restarted. Also, servers are synced to prometheus from inventory and so idrac_exporter theoretically could start polling servers before they are properly configured.

Why such logic was added and what the benefits of it? If it is to reduce number of calls to servers, perhaps, implementing https://en.wikipedia.org/wiki/Exponential_backoff would be a better solution, so that all servers monitoring can be restored eventually without restarting idrac_exporter?

P.S. This was the main cause. By restarting idrac_exporter we just restored monitoring for the rest of servers!

mrlhansen commented 7 months ago

It was implemented to not spend time on targets that could not be reached, but in hindsight I think my choice of defaults was poor. In the next version I am going to implement a new default behavior for the retries parameter, such that if retries is set to 0 it will never mark a target as invalid.

7840vz commented 7 months ago

Thanks! Is there a way to disable it altogether, perhaps by commandline or config flag? To be sure that idrac runs truly stateless and sideffect free? Our use case that we run monitoring in sync with inventory for targets, so non existent servers will be removed from inventory and from monitoring automatically. All servers synced from inventory with specific tag=monitoring should be tried forever. When it is no longer the case, those servers will be removed from Prometheus, so calls to idrac_exporter will stop as well.

This is a guarantee for us that no manual intervention would be required to restore monitoring.

mrlhansen commented 7 months ago

Until I release the next version you can in principle just set retries to a very high number, such that it will never reach the limit. In the next release setting retries=0 will disable this behavior.

The exporter is not completely stateless, because the first time it connects to a host it will find the different endpoints needed to collect the metrics. So if you e.g. swapped the machine on a given IP address with another machine, it could start failing if the endpoints are wrong (e.g. if the machine is from another vendor).

Finding the different endpoints can take 5-10 seconds, so it stores these, since they never change for a given host and it saves time when scraping the metrics. If you have an environment where everything is ephemeral, I see how this can be a potential problem, but I would not expect this to be very common.

7840vz commented 7 months ago

Got it. So, basically, all new collectors goes to collectors map to 'cache' handles used etc. Failed entities go there as well. So perhaps if this map can be transformed to cache with TTL, then those entries can be expired, so there is an extra opportunity to restore metrics collection in cases of initialy faulty server / server swapped?

With something like this https://github.com/patrickmn/go-cache .

Of course, those edge cases can be rare, but possible, as demonstrated above. And looks like they are quite hard to troubleshoot as one could expect that solving connectivity problems on the server-side should be enough and the rest of the metric collection should heal itself eventually.

mrlhansen commented 7 months ago

I don't think a cache with a TTL is the right way to go, because that is also not entirely reliable.

Either one drops the state completely and reinitialize the target every time or you make a call to /reset to reset the internal state for the given target. If you have a truly ephemeral environment where the same IP address can be reused for another machine at random, then you also have some sort of enrollment procedure for the machine. As part of that enrollment procedure you could just reset the target in the exporter.

I agree that what I implemented was not ideal, but I don't want to over-engineer this.

7840vz commented 7 months ago

I see. I'm most worried about the case that machines can be down when idrac_exporter is first started, so ignoring some servers for good.

The only workaround I see is to restart idrac with cronjob daily. Also setting retries to 0 in latest version might do the trick.

Thanks for explaining everything here.

mrlhansen commented 7 months ago

That issue should be permanently solved if you in the current version set retries to a high value (or in the next version to zero).

Anytime, I am always happy to see people using it :)

mrlhansen / idrac_exporter

Better errors for troubleshooting #69