googlegenomics / elasticluster

9 stars 2 forks source link

Grid Engine commands emit: error: commlib error: access denied #12

Open mbookman opened 8 years ago

mbookman commented 8 years ago

I just recently started seeing the following after my cluster has been running for some time. When issuing commands like qstat:

error: commlib error: access denied (client IP resolved to host name "frontend001".
This is not identical to clients host name "gridengine-frontend001.<project>.internal")
error: unable to send message to qmaster using port 6444 on host "frontend001":
got send error

I don't know yet why this started occurring, but I have traced it to the DHCP client on Compute Engine and an associated "set-hostname" hook.

$ sudo find / -name set-hostname
/usr/share/google/set-hostname
/etc/dhcp/dhclient-exit-hooks.d/set-hostname

These are actually the same file:

$ ls -l /etc/dhcp/dhclient-exit-hooks.d/set-hostname 
lrwxrwxrwx 1 root root 30 Feb 19 18:30 /etc/dhcp/dhclient-exit-hooks.d/set-hostname -> /usr/share/google/set-hostname

The DHCP client will episodically call the set-hostname script, which updates /etc/hosts with a fully qualified version of the hostname and then calls hostname to change the hostname.

I have worked around this (and tracked it specifically to the dhcp client) by editing the set-hostname script and adding:

echo $0 $* >> /tmp/set-hostname.txt
ps -p ${PPID} >> /tmp/set-hostname.txt
exit 0

before any code executes, namely before:

# Deal with a new hostname assignment.

This prevents the problem from occurring.

To "fix" a running instance:

  1. Remove the # Added by Google record from /etc/hosts
  2. Call sudo hostname frontend001

Need to find out what the right way is to prevent this from happening. The /tmp/set-hostname.txt log shows:

$ cat /tmp/set-hostname.txt 
/sbin/dhclient-script 0
  PID TTY          TIME CMD
 1634 ?        00:00:00 dhclient
mdmiller53 commented 8 years ago

this doesn't quite match my environment. my error is the same:

error: commlib error: access denied (client IP resolved to host name "master001". This is not identical to clients host name "samtools-index-master001.c.isb-cgc.internal")
Unable to run job: unable to send message to qmaster using port 6444 on host "master001": got send error.

but my /etc/hostname file doesn't look like it's been modified, it contains only 'master001' my fix, suggested by a website was to create a file named /var/lib/gridengine/default/common/host_aliases and add the line:

master001 samtools-index-master001.c.isb-cgc.internal
mbookman commented 8 years ago

Thanks Michael.

The file that is getting updated is /etc/hosts, not /etc/hostname.

The update to the host_aliases file works? That's fantastic.

mdmiller53 commented 8 years ago

ah, yes, there it is indeed. your fix might be more straight-forward and general, although what i found should be good for gridengine:

# THIS FILE IS CONTROLLED BY ANSIBLE
# any local modifications will be overwritten!
#

# This file is managed by Ansible.
127.0.0.1 localhost.localdomain localhost

10.240.0.20 compute001
10.240.0.12 compute002
10.240.0.21 compute003
10.240.0.57 compute004
10.240.0.13 compute005
10.240.0.49 master001
10.240.0.49 samtools-index-master001.c.isb-cgc.internal samtools-index-master001  # Added by Google
pgrosu commented 8 years ago

Hi Matt,

If you have sudo access, it probably would be just easier to update the Ansible Playbook.

~p