GCE: SSH connection timeout on upload_scripts

johnsca commented 7 years ago

This is consistently happening but only on GCE:

arosales@x230:~/tmp$ juju add-model hadoop -c hadoop-test
Added 'hadoop' model on google/us-east1 with credential 'cpc-testing' for user 'admin'
arosales@x230:~/tmp$ 
arosales@x230:~/tmp$ 
arosales@x230:~/tmp$ 
arosales@x230:~/tmp$ 
arosales@x230:~/tmp$ bundletester -t bundle:hadoop-processing  -e hadoop-test:hadoop --bundle bundle.yaml
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): api.jujucharms.com
[sudo] password for arosales: 
gpg: keyring `/tmp/tmplxx_m3nb/secring.gpg' created
gpg: keyring `/tmp/tmplxx_m3nb/pubring.gpg' created
gpg: requesting key C8068B11 from hkp server keyserver.ubuntu.com
gpg: /tmp/tmplxx_m3nb/trustdb.gpg: trustdb created
gpg: key C8068B11: public key "Launchpad Ensemble PPA" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
OK
cs:~bigdata-dev/xenial/ganglia-5
    charm-proof                                                            PASS
    10-bundles-test.py                                                     FAIL

------------------------------------------------------------------------------
FAIL: cs:~bigdata-dev/xenial/ganglia-5::10-bundles-test.py
[/tmp/tmpA1wK0j/ganglia/tests/10-bundles-test.py exit 200]
2017-02-02 05:04:25 Starting deployment of hadoop-test:admin/hadoop
2017-02-02 05:04:26 Deploying applications...
2017-02-02 05:04:26  Deploying application metric-source using cs:xenial/ubuntu-10
2017-02-02 05:04:34 Config specifies num units for subordinate: ganglia-node
2017-02-02 05:07:22 Adding relations...
2017-02-02 05:07:22  Adding relation ganglia-node:juju-info <-> metric-source:juju-info
2017-02-02 05:08:23 Deployment complete in 237.63 seconds
ERROR exit status 1 (ssh: connect to host 35.185.21.178 port 22: Connection timed out
lost connection)
Unable to upload scripts

PASS: 1 FAIL: 1 SKIP:27 Total: 29 (380.76377 sec)
arosales@x230:~/tmp$

After the test fails and the model finishes coming up, ssh does work as expected, so it seems to be a timing issue with Amulet not waiting long enough.

arosales@x230:~/tmp$ juju ssh metric-source/0 'mkdir -p -m a=rwx /tmp/amulet'
Connection to 35.185.21.178 closed.

tvansteenburgh commented 7 years ago

What does juju status look like at the time of the failure?
Is resourcemanager/0 on 35.185.21.178?
The output doesn't even show the resourcemanager application being deployed?
Might be helpful to run bundletester with '-vl DEBUG' for more output.

johnsca commented 7 years ago

https://paste.ubuntu.com/23911596/

I used the wrong pastebin to show SSH working. I updated the ticket description.

johnsca commented 7 years ago

Note that it's a specific charm test that's failing, but the other nodes are left over from the bundle test.

arosales commented 7 years ago

Here is juju status, albeit well after the amulet failure, but does show resource manager IP for questions 1 and 2 from @tvansteenburgh

arosales@x230:~/tmp$ juju status
Model   Controller   Cloud/Region     Version
hadoop  hadoop-test  google/us-east1  2.1-beta4

App                   Version  Status   Scale  Charm                   Store       Rev  OS      Notes
client                         active       1  hadoop-client           jujucharms    2  ubuntu  
ganglia               3.6.0    unknown      1  ganglia                 jujucharms    5  ubuntu  
ganglia-node          3.6.0    unknown      7  ganglia-node            jujucharms    6  ubuntu  
metric-source         16.04    active       1  ubuntu                  jujucharms   10  ubuntu  
namenode              2.7.1    active       1  hadoop-namenode         jujucharms    6  ubuntu  
plugin                2.7.1    active       1  hadoop-plugin           jujucharms    6  ubuntu  
resourcemanager       2.7.1    active       1  hadoop-resourcemanager  jujucharms    6  ubuntu  
rsyslog                        unknown      1  rsyslog                 jujucharms    7  ubuntu  
rsyslog-forwarder-ha           unknown      6  rsyslog-forwarder-ha    jujucharms    7  ubuntu  
slave                 2.7.1    active       3  hadoop-slave            jujucharms    6  ubuntu  

Unit                       Workload  Agent  Machine  Public address   Ports               Message
client/0*                  active    idle   4        35.185.42.220                        ready
  ganglia-node/3           unknown   idle            35.185.42.220                        
  plugin/0*                active    idle            35.185.42.220                        ready (hdfs & yarn)
  rsyslog-forwarder-ha/4   unknown   idle            35.185.42.220                        
ganglia/0*                 unknown   idle   5        104.196.151.64   80/tcp              
metric-source/0*           active    idle   6        35.185.21.178                        ready
  ganglia-node/6           unknown   idle            35.185.21.178                        
namenode/0*                active    idle   0        35.185.12.114    8020/tcp,50070/tcp  ready (3 datanodes)
  ganglia-node/1           unknown   idle            35.185.12.114                        
  rsyslog-forwarder-ha/2   unknown   idle            35.185.12.114                        
resourcemanager/0*         active    idle   0        35.185.12.114    8088/tcp,19888/tcp  ready (3 nodemanagers)
  ganglia-node/5           unknown   idle            35.185.12.114                        
  rsyslog-forwarder-ha/5   unknown   idle            35.185.12.114                        
rsyslog/0*                 unknown   idle   5        104.196.151.64   514/udp             
slave/0*                   active    idle   1        104.196.58.29    8042/tcp,50075/tcp  ready (datanode & nodemanager)
  ganglia-node/2           unknown   idle            104.196.58.29                        
  rsyslog-forwarder-ha/1   unknown   idle            104.196.58.29                        
slave/1                    active    idle   2        104.196.211.58   8042/tcp,50075/tcp  ready (datanode & nodemanager)
  ganglia-node/0*          unknown   idle            104.196.211.58                       
  rsyslog-forwarder-ha/0*  unknown   idle            104.196.211.58                       
slave/2                    active    idle   3        104.196.179.129  8042/tcp,50075/tcp  ready (datanode & nodemanager)
  ganglia-node/4           unknown   idle            104.196.179.129                      
  rsyslog-forwarder-ha/3   unknown   idle            104.196.179.129                      

Machine  State    DNS              Inst id        Series  AZ
0        started  35.185.12.114    juju-69a136-0  xenial  us-east1-d
1        started  104.196.58.29    juju-69a136-1  xenial  us-east1-b
2        started  104.196.211.58   juju-69a136-2  xenial  us-east1-c
3        started  104.196.179.129  juju-69a136-3  xenial  us-east1-d
4        started  35.185.42.220    juju-69a136-4  xenial  us-east1-b
5        started  104.196.151.64   juju-69a136-5  xenial  us-east1-c
6        started  35.185.21.178    juju-69a136-6  xenial  us-east1-b

Relation         Provides              Consumes              Type
juju-info        client                ganglia-node          subordinate
hadoop-plugin    client                plugin                subordinate
juju-info        client                rsyslog-forwarder-ha  subordinate
node             ganglia               ganglia-node          regular
juju-info        ganglia-node          metric-source         regular
juju-info        ganglia-node          namenode              regular
juju-info        ganglia-node          resourcemanager       regular
juju-info        ganglia-node          slave                 regular
juju-info        metric-source         ganglia-node          subordinate
juju-info        namenode              ganglia-node          subordinate
namenode         namenode              plugin                regular
namenode         namenode              resourcemanager       regular
juju-info        namenode              rsyslog-forwarder-ha  subordinate
namenode         namenode              slave                 regular
resourcemanager  plugin                resourcemanager       regular
juju-info        resourcemanager       ganglia-node          subordinate
juju-info        resourcemanager       rsyslog-forwarder-ha  subordinate
resourcemanager  resourcemanager       slave                 regular
syslog           rsyslog               rsyslog-forwarder-ha  regular
juju-info        rsyslog-forwarder-ha  slave                 regular
juju-info        slave                 ganglia-node          subordinate
juju-info        slave                 rsyslog-forwarder-ha  subordinate

arosales@x230:~/tmp$

seman commented 7 years ago

The same issue shows up while testing zeppelin charm, which might be easier and faster to debug.

johnsca commented 7 years ago

I'm not able to replicate this myself unless I remove the firewall rule enabling ssh access. That rule was not enabled by default, and it seems there may have been a change in GCE at some point in how networking is configured which may have caused that rule to be missing.

@seman Do you have access to the GCE account that you used to replicate this to check the network settings?

seman commented 7 years ago

@johnsca Yes, I do have the GCE account. There is a default-allow-ssh rule with default firewall.

kwmonroe commented 7 years ago

This might also happen if amulet tries to juju [scp|ssh] some stuff when the env has changed. IOW, maybe you only hit this when running multiple clouds. Feels like we could use "-m" support when calling ssh or scp:

https://github.com/juju/amulet/blob/master/amulet/sentry.py#L254

--- sentry.py.og        2017-03-24 22:43:16.179250744 +0000
+++ sentry.py   2017-03-24 22:43:39.623251230 +0000
@@ -120,7 +120,7 @@
             # try one more time
             self.ssh(mkdir_cmd, raise_on_failure=True)

-        subprocess.check_call(['juju', 'scp'] +
+        subprocess.check_call(['juju', 'scp', '-m', os.environ.get('JUJU_ENV')] +
                               Path(source).files() +
                               ['{}:{}'.format(self.info['unit_name'], dest)])

@@ -251,7 +251,7 @@

         """
         unit = unit or self.info['unit_name']
-        cmd = ['juju', 'ssh', unit, '-v', command]
+        cmd = ['juju', 'ssh', '-m', os.environ.get('JUJU_ENV'), unit, '-v', command]
         p = subprocess.Popen(
             cmd,
             stdout=subprocess.PIPE,

You may also want this:

--- helpers.py.og       2017-03-24 22:46:12.887251004 +0000
+++ helpers.py  2017-03-24 22:46:42.115250972 +0000
@@ -189,7 +189,7 @@

 def default_environment():
-    return subprocess.check_output(['juju', 'switch']).strip().decode('utf8')
+    return os.environ.get('JUJU_ENV')

 class reify(object):

This obviously relies on JUJU_ENV (which bundletester gives you) vs juju switch to get the right env to operate in. This may also break people running juju-1 or amulet directly (no bundletester).

If this seems sound, I can work on a PR, but it really should be cleaned up (as in, a proper check for juju2 and/or JUJU_ENV before altering these cmds).

johnsca commented 7 years ago

Can you confirm or add a ssh firewall rule like the following:

gce-firewall-rule

johnsca commented 7 years ago

@kwmonroe If the env was switched, it seems like the juju scp command would either fail with "invalid unit" or succeed (if started before the env switch). That said, I do see three concurrent cwr jobs running against different clouds on the same worker right now, so who knows. Definitely worth fixing. However, does Juju 2.0 no longer honor JUJU_ENV? I believe that was a 1.0 thing and might still be supported directly by Juju core.

seman commented 7 years ago

@johnsca The mentioned firewalls are already set

https://drive.google.com/a/canonical.com/file/d/0B4ESgSIXsBlsV29TaEg5N3dhemc/view?usp=sharing

johnsca commented 7 years ago

Damn. That really was my best guess as to why this is happening but I can't reproduce it. I guess the next best thing is to add the explicit controller / model reference that @kwmonroe mentioned and a 3x retry then cross our fingers and hope that it was just due to transient network issues or timing.

I should mention that I did confirm that Amulet already has logic in there to wait for the Juju agent to report that it's started via juju status, so I can't see any reason why the ssh daemon and networking could not be functional by that point. At that point, the agent must have started and been able to communicate with the controller to be able to report that status.

johnsca commented 7 years ago

This has not been fixed. http://juju-cwr.s3-website-us-east-1.amazonaws.com/results-dryrun/zeppelin/bb517b1cd7574b0baef16d66863d9846/report.html

However, I cannot reproduce it, even with the same GCE credentials. Can anyone else still reproduce this outside of the CWR Jenkins environment?

arosales commented 7 years ago

@kwmonroe were seeing this pretty consistently in your GCE cwr environment?

seman commented 7 years ago

This issue is due to juju ssh triggering sshguard to block the incoming ssh connections. You notice this issue after few juju ssh sessions. We only see this on GCE because sshguard is part the GCE default Ubuntu image but not on the other public clouds.

johnsca commented 7 years ago

Relevant bug on Juju core: https://bugs.launchpad.net/juju/+bug/1669501

@kwmonroe noted that there also seems to be some inconsistencies about which zone GCE units are placed in which might play a role in this (I could see having split zones making it more likely that sshguard would get triggered).

kwmonroe commented 7 years ago

Yeah, before @johnsca pointed me at the core bug, I thought maybe there was a intra-region firewall issue. As you can see, my GCE compute engine console shows a single bundle deployment spanning 2 zones within us-west:

Name    Zone    

 juju-219ee9-0  us-west1-a  

 juju-b9540e-0  us-west1-a  

 juju-b9540e-1  us-west1-b  

 juju-b9540e-2  us-west1-a  

 juju-b9540e-3  us-west1-b  

 juju-b9540e-4  us-west1-a  

 juju-b9540e-5  us-west1-b

That top machine is my juju controller in zone a. I see the failure to upload scripts error on zone b machines. It may be a coincidence, or it may be that cross-zone ssh triggers the guard more easily. I'll keep watching to see if i see any upload scripts failures for machines in zone a.

kwmonroe commented 7 years ago

I'm pretty confident now that this is sshguard and not the zone difference. See the 4 preauth bits coming from my CI host, followed by sshguard blocking that ip:

Apr  5 14:31:14 juju-2f68d2-4 sshd[1070]: Connection closed by 162.213.34.190 port 40784 [preauth]
Apr  5 14:31:16 juju-2f68d2-4 sshd[1072]: Accepted publickey for ubuntu from 162.213.34.190 port 40788 ssh2: RSA SHA256:ytUIDf82rBLBr3ENPDmVY55E5JiK1/L8+VxAefdcqYo
Apr  5 14:31:16 juju-2f68d2-4 sshd[1072]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Apr  5 14:31:16 juju-2f68d2-4 systemd-logind[1345]: New session 2 of user ubuntu.
Apr  5 14:31:16 juju-2f68d2-4 systemd: pam_unix(systemd-user:session): session opened for user ubuntu by (uid=0)
Apr  5 14:31:16 juju-2f68d2-4 sshd[1141]: Received disconnect from 162.213.34.190 port 40788:11: disconnected by user
Apr  5 14:31:16 juju-2f68d2-4 sshd[1141]: Disconnected from 162.213.34.190 port 40788
Apr  5 14:31:16 juju-2f68d2-4 sshd[1072]: pam_unix(sshd:session): session closed for user ubuntu
Apr  5 14:31:17 juju-2f68d2-4 systemd-logind[1345]: Removed session 2.
Apr  5 14:31:18 juju-2f68d2-4 sshd[1240]: Connection closed by 162.213.34.190 port 40802 [preauth]
Apr  5 14:31:20 juju-2f68d2-4 sshd[1246]: Accepted publickey for ubuntu from 162.213.34.190 port 40804 ssh2: RSA SHA256:ytUIDf82rBLBr3ENPDmVY55E5JiK1/L8+VxAefdcqYo
Apr  5 14:31:20 juju-2f68d2-4 sshd[1246]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Apr  5 14:31:20 juju-2f68d2-4 systemd: pam_unix(systemd-user:session): session opened for user ubuntu by (uid=0)
Apr  5 14:31:20 juju-2f68d2-4 systemd-logind[1345]: New session 3 of user ubuntu.
Apr  5 14:31:22 juju-2f68d2-4 sshd[1325]: Received disconnect from 162.213.34.190 port 40804:11: disconnected by user
Apr  5 14:31:22 juju-2f68d2-4 sshd[1325]: Disconnected from 162.213.34.190 port 40804
Apr  5 14:31:22 juju-2f68d2-4 sshd[1246]: pam_unix(sshd:session): session closed for user ubuntu
Apr  5 14:31:22 juju-2f68d2-4 systemd-logind[1345]: Removed session 3.
Apr  5 14:31:22 juju-2f68d2-4 systemd: pam_unix(systemd-user:session): session closed for user ubuntu
Apr  5 14:31:24 juju-2f68d2-3 sshd[1309]: Connection closed by 162.213.34.190 port 60056 [preauth]
Apr  5 14:31:25 juju-2f68d2-3 sshd[1311]: Accepted publickey for ubuntu from 162.213.34.190 port 60060 ssh2: RSA SHA256:ytUIDf82rBLBr3ENPDmVY55E5JiK1/L8+VxAefdcqYo
Apr  5 14:31:25 juju-2f68d2-3 sshd[1311]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Apr  5 14:31:25 juju-2f68d2-3 systemd-logind[1579]: New session 2 of user ubuntu.
Apr  5 14:31:25 juju-2f68d2-3 systemd: pam_unix(systemd-user:session): session opened for user ubuntu by (uid=0)
Apr  5 14:31:26 juju-2f68d2-3 sshd[1369]: Received disconnect from 162.213.34.190 port 60060:11: disconnected by user
Apr  5 14:31:26 juju-2f68d2-3 sshd[1369]: Disconnected from 162.213.34.190 port 60060
Apr  5 14:31:26 juju-2f68d2-3 sshd[1311]: pam_unix(sshd:session): session closed for user ubuntu
Apr  5 14:31:26 juju-2f68d2-3 systemd-logind[1579]: Removed session 2.
Apr  5 14:31:26 juju-2f68d2-3 systemd: pam_unix(systemd-user:session): session closed for user ubuntu
Apr  5 14:31:27 juju-2f68d2-3 sshd[1387]: Connection closed by 162.213.34.190 port 60066 [preauth]
Apr  5 14:31:27 juju-2f68d2-5 sshguard[1637]: Blocking 162.213.34.190:4 for >630secs: 40 danger in 4 attacks over 13 seconds (all: 40d in 1 abuses over 13s).

Shortly after, amulet raises an infra failure with the dreaded unable to upload scripts.

As for solutions, I can think of a few:

change juju's [preauth] behavior
remove sshguard from the gce ubuntu images
change amulet to do a sentry.run('service sshguard stop') as soon as possible (still may not be soon enough)
change layer-basic to apt remove sshguard on bootstrap (or document how to do this for charm authors that are willing to run without it)

I hate all of the above, so other suggestions are most welcome.

kwmonroe commented 7 years ago

Fixed in 2.2.0-xenial-amd64.

juju / amulet

GCE: SSH connection timeout on upload_scripts #170