ovis-hpc / ldms

OVIS/LDMS High Performance Computing monitoring, analysis, and visualization project.
https://github.com/ovis-hpc/ovis-wiki/wiki
Other
100 stars 52 forks source link

ldmsd aggregator dies when a hostname of a sampler node can't be resolved #1489

Open johnstile opened 3 weeks ago

johnstile commented 3 weeks ago

ldmsd aggregator dies when a hostname of a sampler node can't be resolved

ldmsd_prdcr_new: my_bad_hostname:6002 not resolved.
msg_no 4654: error 97: Error resolving hostname 'my_bad_hostname:'

I would like for ldmsd to continue working and simply kick the bad host out of the list.

Version: ovis-4.4.2

baallan commented 3 weeks ago

@johnstile given that in some scenarios dns failures/network failures are temporary, why kick it out of the list instead of simply retrying later?

I agree that crashing is not an option.

evanmcc commented 3 weeks ago

Our failures tend to be permanent (at least over the span of the process lifetime), but I agree that it might be better to keep it in the list and keep trying.

We started working on a simple patch to skip the config line in the case of that particular error, but it's kind of unsatisfying as that error code might be returned for some related but not identical reason.

johnstile commented 3 weeks ago

The evaluation of the config file is implemented in a way where the config file is not re-read after the initialization (unlike projects like samba).

This has the benefit that the vulnerability only exists when ldmsd starts, and a dns issue after initialization will leave ldmsd running, and it will try again, as recommend.

I would be satisfied if the ldmsd would simply skip the hostname check.

I have a dumb patch so that we can survive right now, but it is a hack.

From ldms/src/ldmsd/ldmsd_config.c, in __process_config_file, when xprt.rsp_err == 97, it calls goto next_req;

I'm not sure if this is the best error code to return from this function.

Error code 97 - Address family not supported

jstile-lbl commented 3 weeks ago

Created a pull request

https://github.com/ovis-hpc/ldms/pull/1491

tom95858 commented 3 weeks ago

@jstile-lbl I don't think we want to skip the request. I think we want it to add the producer so that the normal producer retry logic will attempt to reconnect. The issue here is that the resolution of the host name is happening at config time, not at connect time. To really fix this, I think we need to move the resolution logic to the connect path. @nichamon, can you take a quick look at that?

nichamon commented 3 weeks ago

@nichamon, can you take a quick look at that?

Yes, I'm wrapping up my current item, and this will be next on my list.

jstile-lbl commented 3 weeks ago

Since we aren't going to exit when the hostname can't be resolved, would the solution be to not run the hostname check?

nichamon commented 3 weeks ago

@jstile-lbl We still want to validate hostnames, but we're going to move this check from config parsing time to the connection phase before connecting to the client. This means ldmsd won't fail during startup due to unresolvable hostnames. In the connection phase, ldmsd will print an error message if the hostname cannot be resolved and try to resolve the hostname again at the next reconnect interval.

tom95858 commented 3 weeks ago

@jstile-lbl We have to convert (resolve) a "dns-name" to an ip-address. It is something that we have to do at some point or we can't connect.

So what we are discussing doing is moving that resolution (hostname --> Ip-address) from the configuration path to the connect path. This would delay the notice of silly hostname errors, but improve resilience to DNS failures. We might do both, i.e. log a warning at configuration time, and then log errors at connect time.

tom95858 commented 3 weeks ago

@jstile-lbl, @nichamon just to be clear...it's not a "validation", it's a conversion/resolution. It has to be done or we can't connect.

baallan commented 3 weeks ago

@tom95858 @nichamon can we leave a 'check and warn' of hostnames at parse time so that issues are discovered sooner at startup? I agree that rewarning at connect time, but please let's not have it warn on every retry.