Closed GoogleCodeExporter closed 9 years ago
The traceback looks right to me. :) Just kidding. That definitely looks like
a
problem.
It looks like it's a problem with Python <2.5 (not x86_64). There's a feature
that
psshlib uses that was introduced in Python 2.5, and the workaround I did for
Python
2.4 seems to be broken. I think I have access to a machine with Python 2.4
somewhere, so I think I should be able to test it out there.
Thanks for the report. I'll let you know when I have something to test.
Original comment by amcna...@gmail.com
on 26 Feb 2010 at 6:27
Okay. I've made a commit that should fix this crash in Python 2.4. Would you
mind
testing to see if this works for you, too? If it works, I will release a
version
2.1.1. Let me know if you need instructions for cloning the Git repository and
testing. Thanks for your help.
Original comment by amcna...@gmail.com
on 26 Feb 2010 at 8:10
Your fix is full of win. Thank you!
$ pssh -i -H localhost date
[1] 19:02:40 [SUCCESS] localhost
Sat Feb 27 19:02:40 UTC 2010
Pete
Original comment by pemer...@gmail.com
on 27 Feb 2010 at 7:04
Further issues, probably similar and probably not warranting a separate ticket,
but if
you want me to break it out, I will.
When I try to run with more than one host, I see this on my Macbook:
$ pssh -i -H localhost -H localhost date
[1] 11:30:58 [SUCCESS] localhost
Sat Feb 27 11:30:58 PST 2010
[2] 11:30:58 [SUCCESS] localhost
Sat Feb 27 11:30:58 PST 2010
When I run on Python 2.4 (same system as above):
$ pssh -i -H localhost -H localhost date
Traceback (most recent call last):
File "/usr/bin/pssh", line 5, in ?
pkg_resources.run_script('pssh==2.1', 'pssh')
File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 489, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 1214, in run_script
exec script_code in namespace, namespace
File "/usr/bin/pssh", line 119, in ?
File "/usr/bin/pssh", line 110, in do_pssh
File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 61, in run
File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 113, in start_tasks
File "build/bdist.linux-x86_64/egg/psshlib/task.py", line 84, in start
File "/usr/lib64/python2.4/subprocess.py", line 550, in __init__
errread, errwrite)
File "/usr/lib64/python2.4/subprocess.py", line 988, in _execute_child
data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB
OSError: [Errno 4] Interrupted system call
Pete
Original comment by pemer...@gmail.com
on 27 Feb 2010 at 7:33
It looks like this is a bug in Python that was fixed today in Python 3.1 and
2.6:
http://bugs.python.org/issue1068268
I wonder if there's any way we can work around this.
Original comment by amcna...@gmail.com
on 1 Mar 2010 at 6:28
I think I have a workaround for the problem described in comments 4 and 5.
pemerson,
would you please do a git pull again and see if this works for you? Thanks.
Original comment by amcna...@gmail.com
on 1 Mar 2010 at 8:33
Original comment by amcna...@gmail.com
on 1 Mar 2010 at 8:39
For me it looks like the first host succeeds, and then the second host is just
hanging.
When I control-c it, I get this:
Traceback (most recent call last):
File "/usr/bin/pssh", line 5, in ?
pkg_resources.run_script('pssh==2.1', 'pssh')
File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 489, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.4/site-packages/setuptools-0.6c11-
py2.4.egg/pkg_resources.py", line 1214, in run_script
exec script_code in namespace, namespace
File "/usr/bin/pssh", line 119, in ?
File "/usr/bin/pssh", line 110, in do_pssh
File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 73, in run
File "build/bdist.linux-x86_64/egg/psshlib/manager.py", line 174, in interrupted
File "build/bdist.linux-x86_64/egg/psshlib/task.py", line 111, in interrupted
File "build/bdist.linux-x86_64/egg/psshlib/task.py", line 99, in _kill
OSError: [Errno 3] No such process
Original comment by pemer...@gmail.com
on 2 Mar 2010 at 4:57
Issue 17 has been merged into this issue.
Original comment by amcna...@gmail.com
on 2 Mar 2010 at 6:13
pemerson, is this with commit "7c6d668" ("work around
http://bugs.python.org/issue1068268")?
I'll keep on looking at it, but I'm not getting any errors when I run the
command you posted in
comment #4. I'll keep on trying to reproduce it, but is there anything you can
think of that
might make it easier for me to reproduce this error? Thanks.
Original comment by amcna...@gmail.com
on 2 Mar 2010 at 6:20
pemerson, I just pushed a commit that should stop the "OSError: [Errno 3] No
such
process" error, but the real problem is that it was hanging to begin with. I'm
still
trying to reproduce this hang.
Original comment by amcna...@gmail.com
on 2 Mar 2010 at 6:29
This was a nasty problem, but I think I've finally fixed it. Please do a "git
pull",
which should get you commit fe8306c, and let me know if you still see problems.
Thanks.
Original comment by amcna...@gmail.com
on 2 Mar 2010 at 9:16
Looks like it's working for me - thanks!
Can you maybe release this as a v2.1.1 when you get a chance?
Original comment by daro...@gmail.com
on 2 Mar 2010 at 10:28
I would love to release this as version 2.1.1, but I'm a little nervous about
doing it
before we hear from pemerson.
Original comment by amcna...@gmail.com
on 2 Mar 2010 at 10:34
pemerson, have you had a chance to try out the fix from yesterday? Thanks.
Original comment by amcna...@gmail.com
on 3 Mar 2010 at 8:47
So strange, I replied, but it looks like gmail ate the outbound email.
All good here!
I think 12 seconds is far too long for a parallel ssh to two nodes,
but that's probably for a separate thread.
Here's the output:
$ time pssh -i -H localhost -H localhost whoami
[1] 02:39:44 [SUCCESS] localhost
pete
[2] 02:39:45 [SUCCESS] localhost
pete
real 0m12.921s
user 0m10.676s
sys 0m1.402s
Original comment by pemer...@gmail.com
on 4 Mar 2010 at 6:09
pemerson, it might be related, so maybe it should still go in this bug report.
Unfortunately, I'm not having much luck reproducing it. On my Python 2.4
system,
pssh does the parallel ssh to two nodes in 0.33 seconds on average. Do you
have any
other information that would help reproduce it? If not, I could whip up a
custom
commit with a bunch of print statements that might be able to give more
information.
I should probably go ahead and release pssh 2.1.1 now, to at least get it
working for
people with Python 2.4, but let's keep on working on your problem in this issue
for
now.
Original comment by amcna...@gmail.com
on 4 Mar 2010 at 6:52
Well, it's definitely in the script, as this works with all due speed:
$ cat mypssh
#!/usr/bin/python
import os
os.system("ssh -A localhost whoami")
os.system("ssh -A localhost whoami")
$ time ./mypssh
pete
pete
real 0m1.236s
user 0m0.014s
sys 0m0.021s
Other than that, I'm not sure how I can help, but I'd be glad to run a custom
pssh
when you can add in some debugging / timing statements.
Pete
Original comment by pemer...@gmail.com
on 4 Mar 2010 at 7:00
I've released PSSH 2.1.1. At least people with Python 2.4 shouldn't see
crashes
anymore.
pemerson, I just pushed a branch called "issue15". Would you please do a "git
pull;
git checkout issue15" and give me the output? The debugging info is a little
crude,
but if it turns out to be helpful, I might leave it in and add a "--debug"
option or
something.
Original comment by amcna...@gmail.com
on 4 Mar 2010 at 7:40
Did you git push issue15?
$ git clone git://aml.cs.byu.edu/pssh.git
Initialized empty Git repository in /home/pete/pssh/.git/
remote: Counting objects: 771, done.
remote: Compressing objects: 100% (423/423), done.
remote: Total 771 (delta 540), reused 452 (delta 323)
Receiving objects: 100% (771/771), 198.62 KiB, done.
Resolving deltas: 100% (540/540), done.
$ cd pssh
$ git checkout issue15
error: pathspec 'issue15' did not match any file(s) known to git.
Original comment by pemer...@gmail.com
on 4 Mar 2010 at 7:50
Oops. That should have been "git checkout origin/issue15". Sorry for the
mistake.
Original comment by amcna...@gmail.com
on 4 Mar 2010 at 7:55
Ah, well, I'm still a git newb (but liking what I've seen so far)!
$ time pssh -i -H localhost -H localhost whoami
Thu Mar 4 20:04:32 2010 process starting
Thu Mar 4 20:04:38 2010 process started
Thu Mar 4 20:04:38 2010 process starting
Thu Mar 4 20:04:44 2010 process started
Thu Mar 4 20:04:44 2010 task still running
Thu Mar 4 20:04:44 2010 task still running
Thu Mar 4 20:04:44 2010 starting select
Thu Mar 4 20:04:44 2010 select finished
Thu Mar 4 20:04:44 2010 closing stderr
Thu Mar 4 20:04:44 2010 task still running
Thu Mar 4 20:04:44 2010 task still running
Thu Mar 4 20:04:44 2010 starting select
Thu Mar 4 20:04:44 2010 select finished
Thu Mar 4 20:04:44 2010 closing stdout
Thu Mar 4 20:04:44 2010 task finished
[1] 20:04:44 [SUCCESS] localhost
pete
Thu Mar 4 20:04:44 2010 task still running
Thu Mar 4 20:04:44 2010 task still running
Thu Mar 4 20:04:44 2010 starting select
Thu Mar 4 20:04:45 2010 select finished
Thu Mar 4 20:04:45 2010 task still running
Thu Mar 4 20:04:45 2010 starting select
Thu Mar 4 20:04:45 2010 select finished
Thu Mar 4 20:04:45 2010 closing stdout
Thu Mar 4 20:04:45 2010 task still running
Thu Mar 4 20:04:45 2010 starting select
Thu Mar 4 20:04:45 2010 select finished
Thu Mar 4 20:04:45 2010 closing stderr
Thu Mar 4 20:04:45 2010 task still running
Thu Mar 4 20:04:45 2010 starting select
Thu Mar 4 20:04:45 2010 handling sigchld
Thu Mar 4 20:04:45 2010 select interrupted
Thu Mar 4 20:04:45 2010 task finished
[2] 20:04:45 [SUCCESS] localhost
pete
real 0m13.008s
user 0m10.684s
sys 0m1.394s
Original comment by pemer...@gmail.com
on 4 Mar 2010 at 8:06
Fascinating. I put a timestamp just before the Popen and just after the Popen
on a
whim. I really didn't think there was a chance that the Popen would actually
be
hanging. I have know idea why the Popen call would hang for 6 seconds. Do you
have
any ideas?
Original comment by amcna...@gmail.com
on 4 Mar 2010 at 8:39
This probably isn't relevant, but what do you get if you do this in the Python
interactive interpreter:
os.sysconf("SC_OPEN_MAX")
Original comment by amcna...@gmail.com
on 4 Mar 2010 at 9:40
$ python
Python 2.4.3 (#1, Sep 3 2009, 15:37:37)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.sysconf("SC_OPEN_MAX")
1000000
Original comment by pemer...@gmail.com
on 4 Mar 2010 at 9:44
What did you do to your system? :) On mine, SC_OPEN_MAX is 4096.
It looks like what's happening is it's taking forever to close all open file
descriptors. In Python 2.6, they added os.closerange to make this more
efficient
when the maximum file descriptor is really high. To improve performance for
older
versions of Python, we could set FD_CLOEXEC with fcntl on all of our file
descriptors. For more information on the problem, see:
http://bugs.python.org/issue1663329
I'll try to see how bad it is to set FD_CLOEXEC as a long-term workaround.
Original comment by amcna...@gmail.com
on 4 Mar 2010 at 10:11
Okay, try running the latest master (with the "set FD_CLOEXEC" commit), and see
if
that goes more quickly.
Original comment by amcna...@gmail.com
on 4 Mar 2010 at 10:30
Oh, HUGE win. Well done!
$ time pssh -i -H localhost -H localhost whoami
[1] 23:41:41 [SUCCESS] localhost
pete
[2] 23:41:41 [SUCCESS] localhost
pete
real 0m0.895s
user 0m0.075s
sys 0m0.031s
Original comment by pemer...@gmail.com
on 4 Mar 2010 at 11:43
I'm glad I could make you happy. :) So why does your system have such a high
maximum
file descriptor number?
Anyway, this fix will show up in version 2.2, which I'm guessing is about a
month
away. One of the main holdups there is man pages; if you want 2.2 to happen
more
quickly, feel free to help with issue #10. :)
Original comment by amcna...@gmail.com
on 5 Mar 2010 at 3:57
Original issue reported on code.google.com by
pemer...@gmail.com
on 26 Feb 2010 at 3:48