Open johnstile opened 3 weeks ago
@johnstile given that in some scenarios dns failures/network failures are temporary, why kick it out of the list instead of simply retrying later?
I agree that crashing is not an option.
Our failures tend to be permanent (at least over the span of the process lifetime), but I agree that it might be better to keep it in the list and keep trying.
We started working on a simple patch to skip the config line in the case of that particular error, but it's kind of unsatisfying as that error code might be returned for some related but not identical reason.
The evaluation of the config file is implemented in a way where the config file is not re-read after the initialization (unlike projects like samba).
This has the benefit that the vulnerability only exists when ldmsd starts, and a dns issue after initialization will leave ldmsd running, and it will try again, as recommend.
I would be satisfied if the ldmsd would simply skip the hostname check.
I have a dumb patch so that we can survive right now, but it is a hack.
From ldms/src/ldmsd/ldmsd_config.c, in __process_config_file, when xprt.rsp_err == 97, it calls goto next_req;
I'm not sure if this is the best error code to return from this function.
Error code 97 - Address family not supported
Created a pull request
@jstile-lbl I don't think we want to skip the request. I think we want it to add the producer so that the normal producer retry logic will attempt to reconnect. The issue here is that the resolution of the host name is happening at config time, not at connect time. To really fix this, I think we need to move the resolution logic to the connect path. @nichamon, can you take a quick look at that?
@nichamon, can you take a quick look at that?
Yes, I'm wrapping up my current item, and this will be next on my list.
Since we aren't going to exit when the hostname can't be resolved, would the solution be to not run the hostname check?
@jstile-lbl We still want to validate hostnames, but we're going to move this check from config parsing time to the connection phase before connecting to the client. This means ldmsd won't fail during startup due to unresolvable hostnames. In the connection phase, ldmsd will print an error message if the hostname cannot be resolved and try to resolve the hostname again at the next reconnect interval.
@jstile-lbl We have to convert (resolve) a "dns-name" to an ip-address. It is something that we have to do at some point or we can't connect.
So what we are discussing doing is moving that resolution (hostname --> Ip-address) from the configuration path to the connect path. This would delay the notice of silly hostname errors, but improve resilience to DNS failures. We might do both, i.e. log a warning at configuration time, and then log errors at connect time.
@jstile-lbl, @nichamon just to be clear...it's not a "validation", it's a conversion/resolution. It has to be done or we can't connect.
@tom95858 @nichamon can we leave a 'check and warn' of hostnames at parse time so that issues are discovered sooner at startup? I agree that rewarning at connect time, but please let's not have it warn on every retry.
ldmsd aggregator dies when a hostname of a sampler node can't be resolved
I would like for ldmsd to continue working and simply kick the bad host out of the list.
Version: ovis-4.4.2