unix1986 / parallel-ssh

Automatically exported from code.google.com/p/parallel-ssh
Other
0 stars 0 forks source link

pssh broken with Python 2.4 #15

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. $ pssh -H localhost date

What is the expected output? What do you see instead?

I would expect to see:

[1] 07:38:11 [SUCCESS] localhost

Instead, I see:

Traceback (most recent call last):
  File "/usr/bin/pssh", line 5, in ?
    pkg_resources.run_script('pssh==2.1', 'pssh')
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 1214, in run_script
    exec script_code in namespace, namespace
  File "/usr/bin/pssh", line 119, in ?

  File "/usr/bin/pssh", line 110, in do_pssh

  File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 67, in run
  File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 203, in poll
  File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 96, in 
handle_sigchld
AttributeError: 'int' object has no attribute 'write'

What version of the product are you using? On what operating system?

pssh-2.1

$ python -V
Python 2.4.3

$ uname -a
Linux testhost 2.6.21-2952xen #1 SMP Tue Feb 12 09:11:36 EST 2008 
x86_64 x86_64 x86_64 GNU/Linux

Please provide any additional information below.

I'm not sure if this is due to the x86_64 architecture or the version of 
Python that I have installed. I repeated the installation on OS X and the test 
above runs just fine. Using ssh instead of pssh on the same installation 
works fine as well.

Original issue reported on code.google.com by pemer...@gmail.com on 26 Feb 2010 at 3:48

GoogleCodeExporter commented 8 years ago
The traceback looks right to me. :)  Just kidding.  That definitely looks like 
a 
problem.

It looks like it's a problem with Python <2.5 (not x86_64).  There's a feature 
that 
psshlib uses that was introduced in Python 2.5, and the workaround I did for 
Python 
2.4 seems to be broken.  I think I have access to a machine with Python 2.4 
somewhere, so I think I should be able to test it out there.

Thanks for the report.  I'll let you know when I have something to test.

Original comment by amcna...@gmail.com on 26 Feb 2010 at 6:27

GoogleCodeExporter commented 8 years ago
Okay.  I've made a commit that should fix this crash in Python 2.4.  Would you 
mind 
testing to see if this works for you, too?  If it works, I will release a 
version 
2.1.1.  Let me know if you need instructions for cloning the Git repository and 
testing.  Thanks for your help.

Original comment by amcna...@gmail.com on 26 Feb 2010 at 8:10

GoogleCodeExporter commented 8 years ago
Your fix is full of win. Thank you!

$ pssh -i -H localhost date
[1] 19:02:40 [SUCCESS] localhost
Sat Feb 27 19:02:40 UTC 2010

Pete

Original comment by pemer...@gmail.com on 27 Feb 2010 at 7:04

GoogleCodeExporter commented 8 years ago
Further issues, probably similar and probably not warranting a separate ticket, 
but if 
you want me to break it out, I will.

When I try to run with more than one host, I see this on my Macbook:

$ pssh -i -H localhost -H localhost date
[1] 11:30:58 [SUCCESS] localhost
Sat Feb 27 11:30:58 PST 2010
[2] 11:30:58 [SUCCESS] localhost
Sat Feb 27 11:30:58 PST 2010

When I run on Python 2.4 (same system as above):

$ pssh -i -H localhost -H localhost date
Traceback (most recent call last):
  File "/usr/bin/pssh", line 5, in ?
    pkg_resources.run_script('pssh==2.1', 'pssh')
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 1214, in run_script
    exec script_code in namespace, namespace
  File "/usr/bin/pssh", line 119, in ?

  File "/usr/bin/pssh", line 110, in do_pssh

  File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 61, in run
  File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 113, in start_tasks
  File "build/bdist.linux-x86_64/egg/psshlib/task.py", line 84, in start
  File "/usr/lib64/python2.4/subprocess.py", line 550, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.4/subprocess.py", line 988, in _execute_child
    data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB
OSError: [Errno 4] Interrupted system call

Pete

Original comment by pemer...@gmail.com on 27 Feb 2010 at 7:33

GoogleCodeExporter commented 8 years ago
It looks like this is a bug in Python that was fixed today in Python 3.1 and 
2.6:

http://bugs.python.org/issue1068268

I wonder if there's any way we can work around this.

Original comment by amcna...@gmail.com on 1 Mar 2010 at 6:28

GoogleCodeExporter commented 8 years ago
I think I have a workaround for the problem described in comments 4 and 5.  
pemerson, 
would you please do a git pull again and see if this works for you?  Thanks.

Original comment by amcna...@gmail.com on 1 Mar 2010 at 8:33

GoogleCodeExporter commented 8 years ago

Original comment by amcna...@gmail.com on 1 Mar 2010 at 8:39

GoogleCodeExporter commented 8 years ago
For me it looks like the first host succeeds, and then the second host is just 
hanging. 
When I control-c it, I get this:

Traceback (most recent call last):
  File "/usr/bin/pssh", line 5, in ?
    pkg_resources.run_script('pssh==2.1', 'pssh')
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 489, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 1214, in run_script
    exec script_code in namespace, namespace
  File "/usr/bin/pssh", line 119, in ?

  File "/usr/bin/pssh", line 110, in do_pssh

  File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 73, in run
  File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 174, in interrupted
  File "build/bdist.linux-x86_64/egg/psshlib/task.py", line 111, in interrupted
  File "build/bdist.linux-x86_64/egg/psshlib/task.py", line 99, in _kill
OSError: [Errno 3] No such process

Original comment by pemer...@gmail.com on 2 Mar 2010 at 4:57

GoogleCodeExporter commented 8 years ago
Issue 17 has been merged into this issue.

Original comment by amcna...@gmail.com on 2 Mar 2010 at 6:13

GoogleCodeExporter commented 8 years ago
pemerson, is this with commit "7c6d668" ("work around 
http://bugs.python.org/issue1068268")?  
I'll keep on looking at it, but I'm not getting any errors when I run the 
command you posted in  
comment #4.  I'll keep on trying to reproduce it, but is there anything you can 
think of that 
might make it easier for me to reproduce this error?  Thanks.

Original comment by amcna...@gmail.com on 2 Mar 2010 at 6:20

GoogleCodeExporter commented 8 years ago
pemerson, I just pushed a commit that should stop the "OSError: [Errno 3] No 
such 
process" error, but the real problem is that it was hanging to begin with.  I'm 
still 
trying to reproduce this hang.

Original comment by amcna...@gmail.com on 2 Mar 2010 at 6:29

GoogleCodeExporter commented 8 years ago
This was a nasty problem, but I think I've finally fixed it.  Please do a "git 
pull", 
which should get you commit fe8306c, and let me know if you still see problems. 

Thanks.

Original comment by amcna...@gmail.com on 2 Mar 2010 at 9:16

GoogleCodeExporter commented 8 years ago
Looks like it's working for me - thanks!

Can you maybe release this as a v2.1.1 when you get a chance?

Original comment by daro...@gmail.com on 2 Mar 2010 at 10:28

GoogleCodeExporter commented 8 years ago
I would love to release this as version 2.1.1, but I'm a little nervous about 
doing it 
before we hear from pemerson.

Original comment by amcna...@gmail.com on 2 Mar 2010 at 10:34

GoogleCodeExporter commented 8 years ago
pemerson, have you had a chance to try out the fix from yesterday?  Thanks.

Original comment by amcna...@gmail.com on 3 Mar 2010 at 8:47

GoogleCodeExporter commented 8 years ago
So strange, I replied, but it looks like gmail ate the outbound email.

All good here!

I think 12 seconds is far too long for a parallel ssh to two nodes,
but that's probably for a separate thread.

Here's the output:

$ time pssh -i -H localhost -H localhost whoami
[1] 02:39:44 [SUCCESS] localhost
pete
[2] 02:39:45 [SUCCESS] localhost
pete

real    0m12.921s
user    0m10.676s
sys     0m1.402s

Original comment by pemer...@gmail.com on 4 Mar 2010 at 6:09

GoogleCodeExporter commented 8 years ago
pemerson, it might be related, so maybe it should still go in this bug report.  
Unfortunately, I'm not having much luck reproducing it.  On my Python 2.4 
system, 
pssh does the parallel ssh to two nodes in 0.33 seconds on average.  Do you 
have any 
other information that would help reproduce it?  If not, I could whip up a 
custom 
commit with a bunch of print statements that might be able to give more 
information.

I should probably go ahead and release pssh 2.1.1 now, to at least get it 
working for 
people with Python 2.4, but let's keep on working on your problem in this issue 
for 
now.

Original comment by amcna...@gmail.com on 4 Mar 2010 at 6:52

GoogleCodeExporter commented 8 years ago
Well, it's definitely in the script, as this works with all due speed:

$ cat mypssh 
#!/usr/bin/python

import os

os.system("ssh -A localhost whoami")
os.system("ssh -A localhost whoami")

$ time ./mypssh 
pete
pete

real    0m1.236s
user    0m0.014s
sys 0m0.021s

Other than that, I'm not sure how I can help, but I'd be glad to run a custom 
pssh 
when you can add in some debugging / timing statements.

Pete

Original comment by pemer...@gmail.com on 4 Mar 2010 at 7:00

GoogleCodeExporter commented 8 years ago
I've released PSSH 2.1.1.  At least people with Python 2.4 shouldn't see 
crashes 
anymore.

pemerson, I just pushed a branch called "issue15".  Would you please do a "git 
pull; 
git checkout issue15" and give me the output?  The debugging info is a little 
crude, 
but if it turns out to be helpful, I might leave it in and add a "--debug" 
option or 
something.

Original comment by amcna...@gmail.com on 4 Mar 2010 at 7:40

GoogleCodeExporter commented 8 years ago
Did you git push issue15?

$ git clone git://aml.cs.byu.edu/pssh.git
Initialized empty Git repository in /home/pete/pssh/.git/
remote: Counting objects: 771, done.
remote: Compressing objects: 100% (423/423), done.
remote: Total 771 (delta 540), reused 452 (delta 323)
Receiving objects: 100% (771/771), 198.62 KiB, done.
Resolving deltas: 100% (540/540), done.
$ cd pssh
$ git checkout issue15
error: pathspec 'issue15' did not match any file(s) known to git.

Original comment by pemer...@gmail.com on 4 Mar 2010 at 7:50

GoogleCodeExporter commented 8 years ago
Oops.  That should have been "git checkout origin/issue15".  Sorry for the 
mistake.

Original comment by amcna...@gmail.com on 4 Mar 2010 at 7:55

GoogleCodeExporter commented 8 years ago
Ah, well, I'm still a git newb (but liking what I've seen so far)!

$ time pssh -i -H localhost -H localhost whoami
Thu Mar  4 20:04:32 2010 process starting
Thu Mar  4 20:04:38 2010 process started
Thu Mar  4 20:04:38 2010 process starting
Thu Mar  4 20:04:44 2010 process started
Thu Mar  4 20:04:44 2010 task still running
Thu Mar  4 20:04:44 2010 task still running
Thu Mar  4 20:04:44 2010 starting select
Thu Mar  4 20:04:44 2010 select finished
Thu Mar  4 20:04:44 2010 closing stderr
Thu Mar  4 20:04:44 2010 task still running
Thu Mar  4 20:04:44 2010 task still running
Thu Mar  4 20:04:44 2010 starting select
Thu Mar  4 20:04:44 2010 select finished
Thu Mar  4 20:04:44 2010 closing stdout
Thu Mar  4 20:04:44 2010 task finished
[1] 20:04:44 [SUCCESS] localhost
pete
Thu Mar  4 20:04:44 2010 task still running
Thu Mar  4 20:04:44 2010 task still running
Thu Mar  4 20:04:44 2010 starting select
Thu Mar  4 20:04:45 2010 select finished
Thu Mar  4 20:04:45 2010 task still running
Thu Mar  4 20:04:45 2010 starting select
Thu Mar  4 20:04:45 2010 select finished
Thu Mar  4 20:04:45 2010 closing stdout
Thu Mar  4 20:04:45 2010 task still running
Thu Mar  4 20:04:45 2010 starting select
Thu Mar  4 20:04:45 2010 select finished
Thu Mar  4 20:04:45 2010 closing stderr
Thu Mar  4 20:04:45 2010 task still running
Thu Mar  4 20:04:45 2010 starting select
Thu Mar  4 20:04:45 2010 handling sigchld
Thu Mar  4 20:04:45 2010 select interrupted
Thu Mar  4 20:04:45 2010 task finished
[2] 20:04:45 [SUCCESS] localhost
pete

real    0m13.008s
user    0m10.684s
sys 0m1.394s

Original comment by pemer...@gmail.com on 4 Mar 2010 at 8:06

GoogleCodeExporter commented 8 years ago
Fascinating.  I put a timestamp just before the Popen and just after the Popen 
on a 
whim.  I really didn't think there was a chance that the Popen would actually 
be 
hanging.  I have know idea why the Popen call would hang for 6 seconds.  Do you 
have 
any ideas?

Original comment by amcna...@gmail.com on 4 Mar 2010 at 8:39

GoogleCodeExporter commented 8 years ago
This probably isn't relevant, but what do you get if you do this in the Python 
interactive interpreter:

os.sysconf("SC_OPEN_MAX")

Original comment by amcna...@gmail.com on 4 Mar 2010 at 9:40

GoogleCodeExporter commented 8 years ago
$ python
Python 2.4.3 (#1, Sep  3 2009, 15:37:37) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.sysconf("SC_OPEN_MAX")
1000000

Original comment by pemer...@gmail.com on 4 Mar 2010 at 9:44

GoogleCodeExporter commented 8 years ago
What did you do to your system? :)  On mine, SC_OPEN_MAX is 4096.

It looks like what's happening is it's taking forever to close all open file 
descriptors.  In Python 2.6, they added os.closerange to make this more 
efficient 
when the maximum file descriptor is really high.  To improve performance for 
older 
versions of Python, we could set FD_CLOEXEC with fcntl on all of our file 
descriptors.  For more information on the problem, see:

http://bugs.python.org/issue1663329

I'll try to see how bad it is to set FD_CLOEXEC as a long-term workaround.

Original comment by amcna...@gmail.com on 4 Mar 2010 at 10:11

GoogleCodeExporter commented 8 years ago
Okay, try running the latest master (with the "set FD_CLOEXEC" commit), and see 
if 
that goes more quickly.

Original comment by amcna...@gmail.com on 4 Mar 2010 at 10:30

GoogleCodeExporter commented 8 years ago
Oh, HUGE win. Well done!

$ time pssh -i -H localhost -H localhost whoami
[1] 23:41:41 [SUCCESS] localhost
pete
[2] 23:41:41 [SUCCESS] localhost
pete

real    0m0.895s
user    0m0.075s
sys 0m0.031s

Original comment by pemer...@gmail.com on 4 Mar 2010 at 11:43

GoogleCodeExporter commented 8 years ago
I'm glad I could make you happy. :)  So why does your system have such a high 
maximum
file descriptor number?

Anyway, this fix will show up in version 2.2, which I'm guessing is about a 
month
away.  One of the main holdups there is man pages; if you want 2.2 to happen 
more
quickly, feel free to help with issue #10. :)

Original comment by amcna...@gmail.com on 5 Mar 2010 at 3:57