Closed johnsca closed 7 years ago
juju status
look like at the time of the failure?https://paste.ubuntu.com/23911596/
I used the wrong pastebin to show SSH working. I updated the ticket description.
Note that it's a specific charm test that's failing, but the other nodes are left over from the bundle test.
Here is juju status, albeit well after the amulet failure, but does show resource manager IP for questions 1 and 2 from @tvansteenburgh
arosales@x230:~/tmp$ juju status
Model Controller Cloud/Region Version
hadoop hadoop-test google/us-east1 2.1-beta4
App Version Status Scale Charm Store Rev OS Notes
client active 1 hadoop-client jujucharms 2 ubuntu
ganglia 3.6.0 unknown 1 ganglia jujucharms 5 ubuntu
ganglia-node 3.6.0 unknown 7 ganglia-node jujucharms 6 ubuntu
metric-source 16.04 active 1 ubuntu jujucharms 10 ubuntu
namenode 2.7.1 active 1 hadoop-namenode jujucharms 6 ubuntu
plugin 2.7.1 active 1 hadoop-plugin jujucharms 6 ubuntu
resourcemanager 2.7.1 active 1 hadoop-resourcemanager jujucharms 6 ubuntu
rsyslog unknown 1 rsyslog jujucharms 7 ubuntu
rsyslog-forwarder-ha unknown 6 rsyslog-forwarder-ha jujucharms 7 ubuntu
slave 2.7.1 active 3 hadoop-slave jujucharms 6 ubuntu
Unit Workload Agent Machine Public address Ports Message
client/0* active idle 4 35.185.42.220 ready
ganglia-node/3 unknown idle 35.185.42.220
plugin/0* active idle 35.185.42.220 ready (hdfs & yarn)
rsyslog-forwarder-ha/4 unknown idle 35.185.42.220
ganglia/0* unknown idle 5 104.196.151.64 80/tcp
metric-source/0* active idle 6 35.185.21.178 ready
ganglia-node/6 unknown idle 35.185.21.178
namenode/0* active idle 0 35.185.12.114 8020/tcp,50070/tcp ready (3 datanodes)
ganglia-node/1 unknown idle 35.185.12.114
rsyslog-forwarder-ha/2 unknown idle 35.185.12.114
resourcemanager/0* active idle 0 35.185.12.114 8088/tcp,19888/tcp ready (3 nodemanagers)
ganglia-node/5 unknown idle 35.185.12.114
rsyslog-forwarder-ha/5 unknown idle 35.185.12.114
rsyslog/0* unknown idle 5 104.196.151.64 514/udp
slave/0* active idle 1 104.196.58.29 8042/tcp,50075/tcp ready (datanode & nodemanager)
ganglia-node/2 unknown idle 104.196.58.29
rsyslog-forwarder-ha/1 unknown idle 104.196.58.29
slave/1 active idle 2 104.196.211.58 8042/tcp,50075/tcp ready (datanode & nodemanager)
ganglia-node/0* unknown idle 104.196.211.58
rsyslog-forwarder-ha/0* unknown idle 104.196.211.58
slave/2 active idle 3 104.196.179.129 8042/tcp,50075/tcp ready (datanode & nodemanager)
ganglia-node/4 unknown idle 104.196.179.129
rsyslog-forwarder-ha/3 unknown idle 104.196.179.129
Machine State DNS Inst id Series AZ
0 started 35.185.12.114 juju-69a136-0 xenial us-east1-d
1 started 104.196.58.29 juju-69a136-1 xenial us-east1-b
2 started 104.196.211.58 juju-69a136-2 xenial us-east1-c
3 started 104.196.179.129 juju-69a136-3 xenial us-east1-d
4 started 35.185.42.220 juju-69a136-4 xenial us-east1-b
5 started 104.196.151.64 juju-69a136-5 xenial us-east1-c
6 started 35.185.21.178 juju-69a136-6 xenial us-east1-b
Relation Provides Consumes Type
juju-info client ganglia-node subordinate
hadoop-plugin client plugin subordinate
juju-info client rsyslog-forwarder-ha subordinate
node ganglia ganglia-node regular
juju-info ganglia-node metric-source regular
juju-info ganglia-node namenode regular
juju-info ganglia-node resourcemanager regular
juju-info ganglia-node slave regular
juju-info metric-source ganglia-node subordinate
juju-info namenode ganglia-node subordinate
namenode namenode plugin regular
namenode namenode resourcemanager regular
juju-info namenode rsyslog-forwarder-ha subordinate
namenode namenode slave regular
resourcemanager plugin resourcemanager regular
juju-info resourcemanager ganglia-node subordinate
juju-info resourcemanager rsyslog-forwarder-ha subordinate
resourcemanager resourcemanager slave regular
syslog rsyslog rsyslog-forwarder-ha regular
juju-info rsyslog-forwarder-ha slave regular
juju-info slave ganglia-node subordinate
juju-info slave rsyslog-forwarder-ha subordinate
arosales@x230:~/tmp$
The same issue shows up while testing zeppelin charm, which might be easier and faster to debug.
I'm not able to replicate this myself unless I remove the firewall rule enabling ssh access. That rule was not enabled by default, and it seems there may have been a change in GCE at some point in how networking is configured which may have caused that rule to be missing.
@seman Do you have access to the GCE account that you used to replicate this to check the network settings?
@johnsca Yes, I do have the GCE account. There is a default-allow-ssh rule with default firewall.
This might also happen if amulet tries to juju [scp|ssh]
some stuff when the env has changed. IOW, maybe you only hit this when running multiple clouds. Feels like we could use "-m" support when calling ssh or scp:
https://github.com/juju/amulet/blob/master/amulet/sentry.py#L254
--- sentry.py.og 2017-03-24 22:43:16.179250744 +0000
+++ sentry.py 2017-03-24 22:43:39.623251230 +0000
@@ -120,7 +120,7 @@
# try one more time
self.ssh(mkdir_cmd, raise_on_failure=True)
- subprocess.check_call(['juju', 'scp'] +
+ subprocess.check_call(['juju', 'scp', '-m', os.environ.get('JUJU_ENV')] +
Path(source).files() +
['{}:{}'.format(self.info['unit_name'], dest)])
@@ -251,7 +251,7 @@
"""
unit = unit or self.info['unit_name']
- cmd = ['juju', 'ssh', unit, '-v', command]
+ cmd = ['juju', 'ssh', '-m', os.environ.get('JUJU_ENV'), unit, '-v', command]
p = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
You may also want this:
--- helpers.py.og 2017-03-24 22:46:12.887251004 +0000
+++ helpers.py 2017-03-24 22:46:42.115250972 +0000
@@ -189,7 +189,7 @@
def default_environment():
- return subprocess.check_output(['juju', 'switch']).strip().decode('utf8')
+ return os.environ.get('JUJU_ENV')
class reify(object):
This obviously relies on JUJU_ENV
(which bundletester gives you) vs juju switch
to get the right env to operate in. This may also break people running juju-1 or amulet directly (no bundletester).
If this seems sound, I can work on a PR, but it really should be cleaned up (as in, a proper check for juju2 and/or JUJU_ENV before altering these cmds).
Can you confirm or add a ssh firewall rule like the following:
@kwmonroe If the env was switched, it seems like the juju scp
command would either fail with "invalid unit" or succeed (if started before the env switch). That said, I do see three concurrent cwr jobs running against different clouds on the same worker right now, so who knows. Definitely worth fixing. However, does Juju 2.0 no longer honor JUJU_ENV
? I believe that was a 1.0 thing and might still be supported directly by Juju core.
@johnsca The mentioned firewalls are already set
https://drive.google.com/a/canonical.com/file/d/0B4ESgSIXsBlsV29TaEg5N3dhemc/view?usp=sharing
Damn. That really was my best guess as to why this is happening but I can't reproduce it. I guess the next best thing is to add the explicit controller / model reference that @kwmonroe mentioned and a 3x retry then cross our fingers and hope that it was just due to transient network issues or timing.
I should mention that I did confirm that Amulet already has logic in there to wait for the Juju agent to report that it's started via juju status
, so I can't see any reason why the ssh daemon and networking could not be functional by that point. At that point, the agent must have started and been able to communicate with the controller to be able to report that status.
This has not been fixed. http://juju-cwr.s3-website-us-east-1.amazonaws.com/results-dryrun/zeppelin/bb517b1cd7574b0baef16d66863d9846/report.html
However, I cannot reproduce it, even with the same GCE credentials. Can anyone else still reproduce this outside of the CWR Jenkins environment?
@kwmonroe were seeing this pretty consistently in your GCE cwr environment?
This issue is due to juju ssh
triggering sshguard
to block the incoming ssh connections. You notice this issue after few juju ssh
sessions. We only see this on GCE because sshguard
is part the GCE default Ubuntu image but not on the other public clouds.
Relevant bug on Juju core: https://bugs.launchpad.net/juju/+bug/1669501
@kwmonroe noted that there also seems to be some inconsistencies about which zone GCE units are placed in which might play a role in this (I could see having split zones making it more likely that sshguard would get triggered).
Yeah, before @johnsca pointed me at the core bug, I thought maybe there was a intra-region firewall issue. As you can see, my GCE compute engine console shows a single bundle deployment spanning 2 zones within us-west
:
Name Zone
juju-219ee9-0 us-west1-a
juju-b9540e-0 us-west1-a
juju-b9540e-1 us-west1-b
juju-b9540e-2 us-west1-a
juju-b9540e-3 us-west1-b
juju-b9540e-4 us-west1-a
juju-b9540e-5 us-west1-b
That top machine is my juju controller in zone a
. I see the failure to upload scripts
error on zone b
machines. It may be a coincidence, or it may be that cross-zone ssh triggers the guard more easily. I'll keep watching to see if i see any upload scripts
failures for machines in zone a
.
I'm pretty confident now that this is sshguard and not the zone difference. See the 4 preauth
bits coming from my CI host, followed by sshguard blocking that ip:
Apr 5 14:31:14 juju-2f68d2-4 sshd[1070]: Connection closed by 162.213.34.190 port 40784 [preauth]
Apr 5 14:31:16 juju-2f68d2-4 sshd[1072]: Accepted publickey for ubuntu from 162.213.34.190 port 40788 ssh2: RSA SHA256:ytUIDf82rBLBr3ENPDmVY55E5JiK1/L8+VxAefdcqYo
Apr 5 14:31:16 juju-2f68d2-4 sshd[1072]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Apr 5 14:31:16 juju-2f68d2-4 systemd-logind[1345]: New session 2 of user ubuntu.
Apr 5 14:31:16 juju-2f68d2-4 systemd: pam_unix(systemd-user:session): session opened for user ubuntu by (uid=0)
Apr 5 14:31:16 juju-2f68d2-4 sshd[1141]: Received disconnect from 162.213.34.190 port 40788:11: disconnected by user
Apr 5 14:31:16 juju-2f68d2-4 sshd[1141]: Disconnected from 162.213.34.190 port 40788
Apr 5 14:31:16 juju-2f68d2-4 sshd[1072]: pam_unix(sshd:session): session closed for user ubuntu
Apr 5 14:31:17 juju-2f68d2-4 systemd-logind[1345]: Removed session 2.
Apr 5 14:31:18 juju-2f68d2-4 sshd[1240]: Connection closed by 162.213.34.190 port 40802 [preauth]
Apr 5 14:31:20 juju-2f68d2-4 sshd[1246]: Accepted publickey for ubuntu from 162.213.34.190 port 40804 ssh2: RSA SHA256:ytUIDf82rBLBr3ENPDmVY55E5JiK1/L8+VxAefdcqYo
Apr 5 14:31:20 juju-2f68d2-4 sshd[1246]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Apr 5 14:31:20 juju-2f68d2-4 systemd: pam_unix(systemd-user:session): session opened for user ubuntu by (uid=0)
Apr 5 14:31:20 juju-2f68d2-4 systemd-logind[1345]: New session 3 of user ubuntu.
Apr 5 14:31:22 juju-2f68d2-4 sshd[1325]: Received disconnect from 162.213.34.190 port 40804:11: disconnected by user
Apr 5 14:31:22 juju-2f68d2-4 sshd[1325]: Disconnected from 162.213.34.190 port 40804
Apr 5 14:31:22 juju-2f68d2-4 sshd[1246]: pam_unix(sshd:session): session closed for user ubuntu
Apr 5 14:31:22 juju-2f68d2-4 systemd-logind[1345]: Removed session 3.
Apr 5 14:31:22 juju-2f68d2-4 systemd: pam_unix(systemd-user:session): session closed for user ubuntu
Apr 5 14:31:24 juju-2f68d2-3 sshd[1309]: Connection closed by 162.213.34.190 port 60056 [preauth]
Apr 5 14:31:25 juju-2f68d2-3 sshd[1311]: Accepted publickey for ubuntu from 162.213.34.190 port 60060 ssh2: RSA SHA256:ytUIDf82rBLBr3ENPDmVY55E5JiK1/L8+VxAefdcqYo
Apr 5 14:31:25 juju-2f68d2-3 sshd[1311]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
Apr 5 14:31:25 juju-2f68d2-3 systemd-logind[1579]: New session 2 of user ubuntu.
Apr 5 14:31:25 juju-2f68d2-3 systemd: pam_unix(systemd-user:session): session opened for user ubuntu by (uid=0)
Apr 5 14:31:26 juju-2f68d2-3 sshd[1369]: Received disconnect from 162.213.34.190 port 60060:11: disconnected by user
Apr 5 14:31:26 juju-2f68d2-3 sshd[1369]: Disconnected from 162.213.34.190 port 60060
Apr 5 14:31:26 juju-2f68d2-3 sshd[1311]: pam_unix(sshd:session): session closed for user ubuntu
Apr 5 14:31:26 juju-2f68d2-3 systemd-logind[1579]: Removed session 2.
Apr 5 14:31:26 juju-2f68d2-3 systemd: pam_unix(systemd-user:session): session closed for user ubuntu
Apr 5 14:31:27 juju-2f68d2-3 sshd[1387]: Connection closed by 162.213.34.190 port 60066 [preauth]
Apr 5 14:31:27 juju-2f68d2-5 sshguard[1637]: Blocking 162.213.34.190:4 for >630secs: 40 danger in 4 attacks over 13 seconds (all: 40d in 1 abuses over 13s).
Shortly after, amulet raises an infra failure with the dreaded unable to upload scripts
.
As for solutions, I can think of a few:
[preauth]
behaviorsentry.run('service sshguard stop')
as soon as possible (still may not be soon enough)apt remove sshguard
on bootstrap (or document how to do this for charm authors that are willing to run without it)I hate all of the above, so other suggestions are most welcome.
Fixed in 2.2.0-xenial-amd64.
This is consistently happening but only on GCE:
After the test fails and the model finishes coming up, ssh does work as expected, so it seems to be a timing issue with Amulet not waiting long enough.