saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14k stars 5.47k forks source link

grains.fqdn value is difficult to rely on #36619

Closed lottspot closed 5 years ago

lottspot commented 7 years ago

It seems that there may be some edge cases where grains.fqdn does not produce the same value as, say, hostname -f. This makes the fqdn grain difficult to predict the value of or rely on.

After investigating the issue further, this appears due to the internal use of socket.fqdn to determine the fqdn grain. It seems that this issue has been raised before and was fixed, but the fix was later undone.

Though the current implementation of get_fqhostname does try to use socket.getaddrinfo, there appears to be no code path through which the canonname value returned by getaddrinfo can then be returned to the caller of get_fqhostname.

I'm not sure whether the intent is to only use socket.getfqdn or to fall back on it only when socket.getaddrinfo fails, but it seems like something is off in either case. From the standpoint of an administrator, it would be more intuitive if get_fqhostname gave preference to the fqdn as returned in the canonname value from socket.getaddrinfo.

slinn0 commented 7 years ago

The settings in /etc/hosts, /etc/host.conf, /etc/gai.conf, and /etc/hostname will affect how that grain comes out. I have seen this problem before with our setup. Inconsistent entries in /etc/hosts caused our issue. Having a file.managed for /etc/hosts might be one way to remove the edge cases.

lottspot commented 7 years ago

The settings in /etc/hosts, /etc/host.conf, /etc/gai.conf, and /etc/hostname will affect how that grain comes out.

The settings in gai.conf won't actually have an effect since get_fqhostname will never return the results of its call to socket.getaddrinfo (unless I'm misreading the function, which is certainly possible).

Managing the /etc/hosts file in order to get the grains.fqdn grain to agree with the output of hostname --fqdn is extremely unintuitive at best, and, depending on the system, my be infeasible at worst.

The problematic behavior of python's socket.getfqdn has been discussed and agreed upon, though it seems that a patch never made it into python 2.7 (not sure what the behavior in 3.4 is). It seems that the consensus is The Right Thing to do is use the canonname returned by a call to getaddrinfo(3) in order to determine the fully qualified name of a machine.

slinn0 commented 7 years ago

I was running a strace -f -e open python2.7 the then import socket then typed socket.fqdn() which shows:

>>> socket.getfqdn()
open("/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 3
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 3
open("/etc/gai.conf", O_RDONLY|O_CLOEXEC) = 3
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 3

In the grains core module fqdn is set with the following python code:

    if __FQDN__ is None:
        __FQDN__ = salt.utils.network.get_fqhostname()
    grains['fqdn'] = __FQDN__

in salt/utils/network.py salt.utils.network.get_fqhostname is as follows:

def get_fqhostname():
    '''                                                                                                                                                                                  
    Returns the fully qualified hostname                                                                                                                                                 
    '''
    l = []
    l.append(socket.getfqdn())

    # try socket.getaddrinfo                                                                                                                                                             
    try:
        addrinfo = socket.getaddrinfo(
            socket.gethostname(), 0, socket.AF_UNSPEC, socket.SOCK_STREAM,
            socket.SOL_TCP, socket.AI_CANONNAME
        )
        for info in addrinfo:
            # info struct [family, socktype, proto, canonname, sockaddr]                                                                                                                 
            if len(info) >= 4:
                l.append(info[3])
    except socket.gaierror:
        pass

    l = _sort_hostnames(l)
    if len(l) > 0:
        return l[0]

    return None

So it looks like it uses a socket.fqdn() then does a try socket.getaddrinfo(). I wrote a simple test program as:

#!/usr/bin/python2.7
import socket
if __name__ == '__main__':
    print("socket.getfqdn(): {}".format(socket.getfqdn()))
    try:
        addrinfo = socket.getaddrinfo(
            socket.gethostname(), 0, socket.AF_UNSPEC, socket.SOCK_STREAM,
            socket.SOL_TCP, socket.AI_CANONNAME
        )
        for info in addrinfo:
            if len(info) >= 4:
                print("found info:{}, info[3]:{}".format(info, info[3]))
    except socket.gaierror:
        print("socket.gaierror error")

then edited /etc/hosts and can see the results change based on whats in there.

There is also the call to _sort_hostnames does some sorting base on IPv6 or localhost names that might cause you grief if you use IPv6 with an address that starts with fe00 or fe02.

I took a look at the man for getaddrinfo() which states:

If hints.ai_flags includes the AI_CANONNAME flag, then the ai_canonname field of the first of the addrinfo structures in the returned list is set to point  to  the  official name of the host.

It seems that /etc/hosts is going play a part in this grain for now. It is possible that the call to socket.getaddrinfo() is failing and since there is only a pass in there it will be silently failing. Maybe we should add a log message to the function?

slinn0 commented 7 years ago

Could the following be your issue? Do we want to add only if canonname is set?

           if len(info) >= 4:
                l.append(info[3])

Might need to be:

           if len(info) >= 4 and info[3]:
                l.append(info[3])

Because sometimes info[3] is a empty string and it gets added to the list of possible hostnames used in fqdn.

lottspot commented 7 years ago

We certainly don't disagree on the code path that sets the grain. In my OP, I linked to the exact same module you just directed me to. It's our understanding of the behavior of get_fqhostname where we seem to be diverging.

Firstly, the sorting method you're referring to is no longer used in current stable. Secondly, while the get_fqhostname function does make a call to socket.getaddrinfo, the return values from its call to socket.getaddrinfo will never be returned from get_fqhostname, because the return value of socket.getfqdn is always appended as element 0 in list l and the return value will always be either l's element 0 or None. Consequently, the entire code path which calls socket.getaddrinfo is essentially a dead path.

This could easily be fixed by simply moving the call to socket.fqdn after the call to socket.getaddrinfo, like:

def get_fqhostname():
    '''
    Returns the fully qualified hostname
    '''
    l = []

    # try socket.getaddrinfo
    try:
        addrinfo = socket.getaddrinfo(
            socket.gethostname(), 0, socket.AF_UNSPEC, socket.SOCK_STREAM,
            socket.SOL_TCP, socket.AI_CANONNAME
        )
        for info in addrinfo:
            # info struct [family, socktype, proto, canonname, sockaddr]
            if len(info) >= 4:
                l.append(info[3])
    except socket.gaierror:
        pass

    l.append(socket.getfqdn())

return l and l[0] or None

This would give preference to the canonname returned by socket.getaddrinfo and fall back on socket.getfqdn where a call to socket.getaddrinfo fails.

I was running a strace -f -e open python2.7 the then import socket then typed socket.fqdn() which shows:

I stand corrected! Though your experiment shows that gai.conf apparently plays some role, the nature of that role is still painfully opaque, which goes back to my point about the entire behavior of socket.fqdn being woefully unreliable, which has the impact of limiting the usefulness of grains.fqdn.

The fact of the matter is though that socket.fqdn is internally making a call to socket.gethostbyaddr, which in turn wraps the gethostbyaddr system call, which is itself documented as being obsolete, and recommends instead using getaddrinfo.. Beyond the obsolescence though, the fact that it fundamentally relies on the gethostbyaddr system call means that the fully qualified name returned by socket.getfqdn is simply unreliable. An action as simple as adding IP addresses to a machine can cause its fqdn to change.

lottspot commented 7 years ago

I'm certainly willing to write up a PR, but would rather know whether there's interest in changing this behavior before investing time in doing so

Ch3LL commented 7 years ago

ping @cachedout can I get your input on this issue? Not sure which is the best route to ensure these edge cases are also taken care of. Thanks!

furlongm commented 7 years ago

I would definitely be interested in seeing a PR that changes this behaviour. I have been trying to figure out why FQDNs are broken on salt on linux. They do not return the same output as hostname -f, rather localhost.domainname where domainname is correct, but localhost should be hostname -s. If I remove the localhost lines from /etc/hosts, this resolves the issue, however on puppet, chef and ansible, this is not needed, and the FQDN returns the output of hostname -f, as expected.

defanator commented 7 years ago

Just caught an issue when two different salt versions on two hosts with the same OS (Ubuntu 14.04) and DNS/hosts/hostnames setup provide different fqdn grain:

Expected one:

$ hostname -f
lb-a9723eb4.example.com

$ sudo salt-call grains.get fqdn
local:
    lb-a9723eb4.example.com

$ sudo salt-call --versions-report
Salt Version:
           Salt: 2016.3.4

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.4.2
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.0.1
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-48-generic
         system: Linux
        version: Ubuntu 14.04 trusty

Unexpected (broken) one:

$ hostname -f
frontend-025da143.example.com

$ sudo salt-call grains.get fqdn
local:
    frontend-025da143

$ sudo salt-call --versions-report
Salt Version:
           Salt: 2016.11.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.4.2
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
   python-gnupg: Not Installed
         PyYAML: 3.10
          PyZMQ: 14.0.1
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-48-generic
         system: Linux
        version: Ubuntu 14.04 trusty

This is really weird and unreliable behavior. In our case it breaks mail delivery due to shortened hostnames/destinations in postfix configuration.

We're also observing wrong fqdn on a number of other systems, including Ubuntu 16.04 running salt 2016.11.1, where hostname -f provides correct information.

I vote for an option when fqdn grain will match to output of hostname -f.

jdshewey commented 7 years ago

This appears to be an upstream bug. This grain call's python's socket.getfqdn(), which in turn reportedly calls the C++ gethostbyname which, in this case, will just return the value passed to it. If the value passed isn't the correct FQDN, you get the wrong answer - garbage in garbage out. The parameter passed by python is kernelhostname.

Several people have reported that putting the fqdn in your /etc/hosts file next to 127.0.0.1 resolves the issue, however I found that I needed to re-set the kernel.hostname entry on my Red Hat based box with sysctl kernel.hostname=hostname.example.com so kernelhostname appears this is populated by systemctl on those systems and probably by /etc/hosts on debian and other-based systems.

cachedout commented 7 years ago

@jdshewey has this exactly right. I've looked at this before and come to the same conclusion.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

hatifnatt commented 3 years ago

I think this still can be an issue, in certain conditions unexpected value can be returned.

Salt Version:
           Salt: 2019.2.5

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.5.0
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: 2.0.0
      gitpython: 2.1.1
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.24.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: 1.3.7
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.13 (default, Sep 26 2018, 18:42:22)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: 2.0.1
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.2.1

System Versions:
           dist: debian 9.12
         locale: UTF-8
        machine: x86_64
        release: 4.9.0-12-amd64
         system: Linux
        version: debian 9.12

Correct / expected output by hostname -f

$ hostname -f
myhostname.my.tld

Unexpected data in grains

$ salt-call grains.get fqdn
local:
    localhost

Cause of the issue: 1) socket.getfqdn() used as main method to get FQDN https://github.com/saltstack/salt/blob/master/salt/utils/network.py#L240 https://github.com/saltstack/salt/blob/dc0595cc811f541044efe89446d5f212968db7e3/salt/utils/network.py#L240-L263

2) Unexpected / incorrect line in /etc/hosts

127.0.0.1       localhost
127.0.1.1       myhostname.my.tld       myhostname

# this cause problem
::1       myhostname

In develop branch socket.getaddrinfo() used and the problem does not appear even with not fully correct /etc/hosts https://github.com/saltstack/salt/blob/develop/salt/utils/network.py#L199 https://github.com/saltstack/salt/blob/637fe0b04f38b2274191b005d73b3c6707d7f400/salt/utils/network.py#L199-L223

Basic Python test.

± python
Python 2.7.13 (default, Sep 26 2018, 18:42:22)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.getfqdn()
'localhost'
>>> socket.getaddrinfo(socket.gethostname(), 0, socket.AF_UNSPEC, socket.SOCK_STREAM, socket.SOL_TCP, socket.AI_CANONNAME)
[(10, 1, 6, 'myhostname.my.tld', ('::1', 0, 0, 0)), (2, 1, 6, '', ('127.0.1.1', 0))]
ashwin-subramanian commented 1 year ago

I think this still can be an issue, in certain conditions unexpected value can be returned.

Salt Version:
           Salt: 2019.2.5

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.5.0
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: 2.0.0
      gitpython: 2.1.1
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.24.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: 1.3.7
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.13 (default, Sep 26 2018, 18:42:22)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: 2.0.1
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.2.1

System Versions:
           dist: debian 9.12
         locale: UTF-8
        machine: x86_64
        release: 4.9.0-12-amd64
         system: Linux
        version: debian 9.12

Correct / expected output by hostname -f

$ hostname -f
myhostname.my.tld

Unexpected data in grains

$ salt-call grains.get fqdn
local:
    localhost

Cause of the issue:

  1. socket.getfqdn() used as main method to get FQDN https://github.com/saltstack/salt/blob/master/salt/utils/network.py#L240 https://github.com/saltstack/salt/blob/dc0595cc811f541044efe89446d5f212968db7e3/salt/utils/network.py#L240-L263
  2. Unexpected / incorrect line in /etc/hosts
127.0.0.1       localhost
127.0.1.1       myhostname.my.tld       myhostname

# this cause problem
::1       myhostname

In develop branch socket.getaddrinfo() used and the problem does not appear even with not fully correct /etc/hosts https://github.com/saltstack/salt/blob/develop/salt/utils/network.py#L199

https://github.com/saltstack/salt/blob/637fe0b04f38b2274191b005d73b3c6707d7f400/salt/utils/network.py#L199-L223

Basic Python test.

± python
Python 2.7.13 (default, Sep 26 2018, 18:42:22)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.getfqdn()
'localhost'
>>> socket.getaddrinfo(socket.gethostname(), 0, socket.AF_UNSPEC, socket.SOCK_STREAM, socket.SOL_TCP, socket.AI_CANONNAME)
[(10, 1, 6, 'myhostname.my.tld', ('::1', 0, 0, 0)), (2, 1, 6, '', ('127.0.1.1', 0))]

@hatifnatt Did you find a resolution for this? Am running in to the same thing.

hatifnatt commented 1 year ago

@ashwin-subramanian I don't remember how exactly I have fixed this issue, it was about 2 years ago... Probably I have corrected my /etc/hosts file. I have not encountered this problem anymore since then.