unix1986 / parallel-ssh

Automatically exported from code.google.com/p/parallel-ssh
Other
0 stars 0 forks source link

enhancement: option to check for open and active SSH connections on all hosts #30

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
it would be great if there was an option for parallel-ssh that would just 
connect and authenticate to all the nodes, reporting status for each one as it 
does now and then exit with 0 iff all the nodes were successfully connected and 
authenticated to.

thoughts?

Original issue reported on code.google.com by mdennis%...@gtempaccount.com on 17 Dec 2010 at 6:43

GoogleCodeExporter commented 8 years ago
I often do something like "pssh -h myhosts -t 5 echo hi" for this purpose.  I 
believe that this would meet the needs that you describe; is there anything 
that it's missing?  Let me know what you think.

Original comment by amcna...@gmail.com on 20 Dec 2010 at 9:18

GoogleCodeExporter commented 8 years ago
That doesn't really address my problem because if the machines are down, pssh 
still exits with 0 so the caller can't determine if all the machines are up.

Normally it makes sense for pssh et al to exit(0) even if some commands fail, 
but not always.

The more I think about it now, the more I think all the tools need an extra 
option; something like "--exit-one-on-failure" that if passed will cause pssh 
et al to exit(1) if any of the requests fail.

That would solve my immediate problem by allowing

"pssh -h hosts -t 10 --exit-one-on-failure exit 0" || doFailureCode()

Original comment by mdennis%...@gtempaccount.com on 21 Dec 2010 at 7:26

GoogleCodeExporter commented 8 years ago
Hmm.  Shouldn't pssh always exit with an error if there's a single failure.  I 
had thought that this was already happening.  The current behavior sounds like 
a bug to me; can you think of any particular reason that it should exit(0) even 
if some commands fail?

Original comment by amcna...@gmail.com on 21 Dec 2010 at 8:34

GoogleCodeExporter commented 8 years ago
That certainly isn't happening right now.

My argument for it returning 0 is to be able to distinguish between pssh having 
a problem and the remote servers and/or ssh having a problem.  This could also 
be accomplished with using different return codes for each.  For example 1 for 
pssh failure (couldn't allocate memory, bad args, etc) and 2 for remote/ssh 
failure (timeout, key rejected, connection refused, remote command exited with 
non-zero return, etc).  This is similar to how grep et al works.  If grep 
matches anything, it exits 0.  If it doesn't match anything, it returns 1.

I'm not against making it exit(somethingNotZero) if a ssh command failed by 
default, but I figured that was pretty explicit functionality to have in there 
so assumed it was done on purpose.

Original comment by mdennis%...@gtempaccount.com on 22 Dec 2010 at 12:47

GoogleCodeExporter commented 8 years ago
I like the idea of having different error codes to discern between different 
problems.  Do you have any suggestions about what the error codes should mean?  
One possibility would be to return the number of hosts that failed, perhaps 
with a "-1" if it's some fatal early error (such as an invalid hosts file).  
Any thoughts?

Original comment by amcna...@gmail.com on 9 Jan 2011 at 6:57

GoogleCodeExporter commented 8 years ago
negative returns can be somewhat of an issue on most systems, as can numbers 
above 255.

As examples try:

python -c 'import sys; sys.exit(-1)'; echo $?

and

python -c 'import sys; sys.exit(256)'; echo $?

http://www.gnu.org/software/libc/manual/html_node/Exit-Status.html may be 
helpful to you here.

Since reporting the numbers of failures above 255 isn't possible, I don't think 
that's a workable solution since it would limit the use of pssh to less than 
256 nodes which would be a real problem.

I would just do something simple like:

0: OK
1: pssh failure (couldn't execute a subprocess for one or more hosts for some 
reason)
2: ssh and/or remote failure of one or more hosts (subprocess was executed but 
returned non-zero)

Personally I don't think anything more is all that useful as in most cases 
there is nothing an automated caller could do to fix it and a interactive 
caller can read the output.

Original comment by mdennis%...@gtempaccount.com on 9 Jan 2011 at 7:46

GoogleCodeExporter commented 8 years ago
Doesn't a return code of -1 turn into 255?  We could return the number of 
failed hosts up to 250 or something, with -1 being a pssh failure.

Or more in line with your proposal, it might make sense to have a different 
return code if all ssh commands fail than if only some of the ssh commands fail.

I suppose either of these would be better than what we're doing right now, but 
at the moment I don't have a strong preference.

Original comment by amcna...@gmail.com on 10 Jan 2011 at 2:24

GoogleCodeExporter commented 8 years ago
Hmm.  In addition to whether one or more processes failed, there is also the 
issue of whether a process returned a non-0 exit status.  I need to think about 
this a bit more, but I think there are several different values of exit status 
that we might want to provide.  Here's what I'm thinking right now:

0: all commands successful and returned 0
1: at least one remote command returned a non-0 value (but all commands ran)
2: at least one ssh command returned 255 (connection error, bad password, etc.)
3: at least one ssh process timed out or killed by a signal
4: internal pssh error

Analogous exit statuses would be used for prsync, pscp, etc. (although some 
might not exit with a value of 1).  Any thoughts?  Is there anything else 
missing from this list?  I'll send an email to the mailing list to solicit 
additional input.

Original comment by amcna...@gmail.com on 18 Jan 2011 at 11:59

GoogleCodeExporter commented 8 years ago
The errors you mention are not necessarily mutually exclusive. Use a bitfield; 
that is, assign powers of two to them and add them up.

Original comment by mark.d.k...@gmail.com on 19 Jan 2011 at 3:13

GoogleCodeExporter commented 8 years ago
Indeed they aren't mutually exclusive--my thought was to return the max (most 
severe).  The bitfield idea is clever, but I'm not sure if I've come across it 
in this context.  Is there any precedent for using bitfields for exit status 
codes?  I know that bash provides an arithmetic operator for bitwise AND, but 
overall it seems like there isn't much shell-level support for this.  What do 
you think?

Original comment by amcna...@gmail.com on 19 Jan 2011 at 5:39

GoogleCodeExporter commented 8 years ago
I've looked into this, and so far I haven't been able to find any other 
programs that use bit fields for exit status.  Combined with the fact that the 
"test" command doesn't have any bitwise operators, I'm edging towards the 
scheme from comment #8, with the plan to make the semantics clear in the man 
page.

Original comment by amcna...@gmail.com on 19 Jan 2011 at 8:27

GoogleCodeExporter commented 8 years ago
Meaningful exit status codes were added to pssh in commit 4ef1fea.  The pssh 
man page includes documentation on the subject.  I still need to fix the other 
commands.  Please let me know if you see any problems or if you have any 
last-minute feedback.

Original comment by amcna...@gmail.com on 21 Jan 2011 at 10:30

GoogleCodeExporter commented 8 years ago
Okay, this is done for the others as well (although we still need to add man 
pages for these).  I'm going to mark this as closed, but please reopen it if 
you see any concrete or subjective problems with the implementation.  Thanks.

Original comment by amcna...@gmail.com on 21 Jan 2011 at 10:37

GoogleCodeExporter commented 8 years ago
Works in bash:

> bash -c 'bash -c "exit 5"; xit=$?; if (( $xit & 1 )); then echo "1 bit set"; 
fi; if (( $xit & 2 )); then echo "2 bit set"; fi; if (( $xit & 4 )); then echo 
"4 bit set"; fi;'
1 bit set
4 bit set

Works in tcsh:

> /bin/tcsh -c '
> /bin/tcsh -c "exit 5"
> set xit=$?
> if ( ( $xit & 1 ) != 0 ) then
>   echo "1 bit set"
> endif
> if ( ( $xit & 2 ) != 0 ) then
>   echo "2 bit set"
> endif
> if ( ( $xit & 4 ) != 0 ) then
>   echo "4 bit set"
> endif
> '
1 bit set
4 bit set

Generating the errors themselves does not require bitwise operators, just 
addition.

Original comment by mark.d.k...@gmail.com on 24 Jan 2011 at 5:18