perfsonar / toolkit

perfSONAR Toolkit distribution environment scripts and GUI
Apache License 2.0
30 stars 5 forks source link

Conflict in psconfig resolve with DNS alias #475

Closed rhclopes closed 4 weeks ago

rhclopes commented 1 month ago

The command 'psconfig config remote add url' will fail to add a test if an ip address resolves to an alias. For example,

raullopes@JMQWPYG263DJ-CE ~ % host ps-slough-lat.perf.ja.net
ps-slough-lat.perf.ja.net has address 194.81.18.229
ps-slough-lat.perf.ja.net has IPv6 address 2001:630:3c:f803::a
raullopes@JMQWPYG263DJ-CE ~ % host 194.81.18.229
229.18.81.194.in-addr.arpa is an alias for 229.224/27.18.81.194.in-addr.arpa.
229.224/27.18.81.194.in-addr.arpa domain name pointer ps-slough-lat.perf.ja.net.

When we run

psconfig remote add "https://..."

No latencybg tests are generated for the host ps-slough-lat.perf.ja.net.

The problem is fixed by adding the following lines to /etc/hosts

194.81.18.229 ps-slough-lat.perf.ja.net
2001:630:3c:f803::a ps-slough-lat.perf.ja.net

and prioritising hosts in /etc/nsswitch.conf.

BTW, I think that traceroute and tracepath will segfault in the presence of that alias, even if the alias is well defined.

Raul

timchown commented 1 month ago

That is so weird, but we have seen it before. Would be really nice to understand what causes it.

arlake228 commented 1 month ago

Just adding some notes from our debug session on this last week:

arlake228 commented 1 month ago

@rhclopes do you happen to have /var/log/perfsonar/psconfig-pscheduler-agent.log from a time that the issue occurred. Specifically looking for a line like the following (except with your addresses):

2024-05-20 19:29:45 INFO pid=4153354 prog=_run_start line=104 guid=f0274777-95bf-4e68-a077-8da5cf5a8a20 pscheduler_assist_url=https://localhost/pscheduler match_addresses=["ps-dev-staging-el9-tk2.c.esnet-perfsonar.internal", "10.128.15.219", "fe80::fedc:61d7:480c:7ed4"] msg=Auto-detected match addresses

IIRC when you were screen sharing ps-slough-lat.perf.ja.net was in the match_addresses list, but wanted to be sure. Also, prior to modifying /etc/hosts, was their an entry for 194.81.18.229 and/or 2001:630:3c:f803::a already?

rhclopes commented 1 month ago

Andy,

The requested logs are attached.

psconfig-pscheduler-agent.log

I spent quite a few days looking into this problem. At some point I had the entries in /etc/hosts, removed, change to other values, and nothing was helping because there were other problems. Eventually, I closed on one problem left.: the DNS issue. I added the entries and changed switch,conf because I had seen a similar problem with perfsonar 5.0, traceroute (and my distributed storage) and I just followed the same solution.

arlake228 commented 1 month ago

Thanks! I think we are closer to the cause of the issue. It was detecting the name ps-slough-lat.ja.net instead of ps-slough-lat.perf.ja.net (the latter has .perf). Do you have any ideas where ps-slough-lat.ja.net might have been coming from? /etc/hosts maybe? At least currently I don't see that name in DNS.

timchown commented 1 month ago

The ps-slough-lat.ja.net is a previous name used for the interface, as was ps-slough-1g.ja.net. We were asked to move our network performance systems all under .perf.ja.net (which seemed quite reasonable) so that's the only name you should see now. I recall Raul did a complete re-install so that name shouldn't be held or cached anywhere in the system or its configuration. The problem seems to be around the aliasing in the reverse delegation?

arlake228 commented 4 weeks ago

This should be corrected now. We can re-open if surfaces again.