madhuneal / ppss

Automatically exported from code.google.com/p/ppss
0 stars 0 forks source link

distributing processing several issues #44

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. ppss deploy -C config.cfg (works ok gets configs)
2. ppss start -C config.cfg (never starts process on nodes)
3. ppss erase -C config.cfg (works ok erase configs on nodes)
4. ppss status -C config.cfg (all nodes are unknown.  dns is fine and hosts 
files as well are used)

What is the expected output? What do you see instead?
Processes should start running on nodes and status should know the hosts names 
and output being processed.

What version of the product are you using? On what operating system?
ppss 2.85

Please provide any additional information below.
config files build but sometimes hang as all modes hang from time to time.
Dns is fine.  The deployment goes ok.  Files and configs get to the noes.

Processes never start on nodes nor does it recognize the hostnames under status.
Lot of problems with input and output directories within scripts.

Things like #ITEM always apending the directories ex.  
/ppss/ppss-home//ppss/ppss-home/ps  (Most would expect just a filename only)

ppss home directory is /home/ppss.  I assume working directory is then
/home/ppss/ppss-home.

Occasional mux errors and mkfifo errors while invoking start mode.

Original issue reported on code.google.com by r3su...@gmail.com on 25 Feb 2011 at 8:22

GoogleCodeExporter commented 9 years ago
I have a lot of confusion as to where you intend the output -o directories to 
be.  The script never seems to utilize it.  ex.  -c 'ps2pdf ' -o OUTPUT will 
ignore -o and put the output files in the current directory.  -c 'lame ' -o 
OUTPUT will ignore -o option and put files in the same directory as specified 
by the -d (source files) option.

????????

Distributed -o options seem to be ignored as well.

Original comment by r3su...@gmail.com on 25 Feb 2011 at 8:29

GoogleCodeExporter commented 9 years ago
Bash.  server is opensuse 11.3 and nodes are (1) opensuse 11.3 and (2) centos5 
machines.  bash is used.

Can work around some output directory issues in standalone but distributed mode 
has more issues for me.  

Most important is to why the nodes do not kick off and run any ppss processes.

Original comment by r3su...@gmail.com on 25 Feb 2011 at 8:34

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The process from start to finish according to docs.  Data directory is 
/data/ppss (NFS). All nodes can ssh as ppss to server as ppss.  Server can ssh 
to all nodes as ppss to node ppss users.  No problems.

Deployments are delivered.  But with ssh issues possibly.

Normal ssh issues on our network are fine with no errors.

ppss@chewey:~> ll
total 28
drwxr-xr-x 2 ppss users 4096 2011-02-24 18:48 bin
-rwxr-xr-x 1 ppss users  160 2011-02-25 16:26 build_config.sh
-rw-r--r-- 1 ppss users  288 2011-02-25 16:27 config.cfg
drwxr-xr-x 2 ppss users 4096 2011-02-24 18:48 Documents
-rw-r--r-- 1 ppss users    0 2011-02-25 16:27 known_hosts
-rw-r--r-- 1 ppss users   12 2011-02-25 16:12 nodes.txt
drwxr-xr-x 3 ppss users 4096 2011-02-25 15:29 public_html
-rw-r--r-- 1 ppss users   15 2011-02-25 15:49 status.txt
ppss@chewey:~> cat config.cfg
SRC_DIR=/data/ppss/wav
COMMAND='lame '
SSH_SERVER=192.168.1.3
USER=ppss
SSH_KEY=/home/ppss/.ssh/id_rsa
NODES_FILE=nodes.txt
REMOTE_OUTPUT_DIR=/data/ppss/OUTPUT
UPLOAD_TO_SERVER=1
DOWNLOAD_TO_NODE=1
PPSS_LOCAL_TMPDIR=ppss_dir/PPSS_LOCAL_TMPDIR
PPSS_LOCAL_OUTPUT=ppss_dir/PPSS_LOCAL_OUTPUT
ppss@chewey:~> clear
ppss@chewey:~> ll
total 28
drwxr-xr-x 2 ppss users 4096 2011-02-24 18:48 bin
-rwxr-xr-x 1 ppss users  160 2011-02-25 16:26 build_config.sh
-rw-r--r-- 1 ppss users  288 2011-02-25 16:27 config.cfg
drwxr-xr-x 2 ppss users 4096 2011-02-24 18:48 Documents
-rw-r--r-- 1 ppss users    0 2011-02-25 16:27 known_hosts
-rw-r--r-- 1 ppss users   12 2011-02-25 16:12 nodes.txt
drwxr-xr-x 3 ppss users 4096 2011-02-25 15:29 public_html
-rw-r--r-- 1 ppss users   15 2011-02-25 15:49 status.txt
ppss@chewey:~> df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/system-root
                       30G  8.0G   21G  29% /
udev                  314M  4.1M  310M   2% /dev
/dev/mapper/system-Backups
                       40G   18G   21G  47% /Backups
/dev/sda1              69M   34M   32M  52% /boot
/dev/mapper/system-home
                       40G  855M   37G   3% /home
/dev/mapper/system-data
                       20G  8.8G   10G  47% /data
192.168.1.2:/P        147G   84G   56G  61% /P
ppss@chewey:~> cat /etc/exports
/data   *.robot.com(rw) 192.168.1.7(rw)
ppss@chewey:~> cat config.cfg
SRC_DIR=/data/ppss/wav
COMMAND='lame '
SSH_SERVER=192.168.1.3
USER=ppss
SSH_KEY=/home/ppss/.ssh/id_rsa
NODES_FILE=nodes.txt
REMOTE_OUTPUT_DIR=/data/ppss/OUTPUT
UPLOAD_TO_SERVER=1
DOWNLOAD_TO_NODE=1
PPSS_LOCAL_TMPDIR=ppss_dir/PPSS_LOCAL_TMPDIR
PPSS_LOCAL_OUTPUT=ppss_dir/PPSS_LOCAL_OUTPUT
ppss@chewey:~> ppss deploy -C config.cfg
Feb 25 16:29:20:
Feb 25 16:29:20:  =========================================================
Feb 25 16:29:20:                         |P|P|S|S|
Feb 25 16:29:20:  Distributed Parallel Processing Shell Script vers. 2.85
Feb 25 16:29:20:  =========================================================
Feb 25 16:29:20:  Hostname:             chewey
Feb 25 16:29:20:  ---------------------------------------------------------
Feb 25 16:29:21:  Deploying PPSS on nodes.
Feb 25 16:29:23:  PPSS installed on node 192.168.1.7.
Feb 25 16:29:23:  PPSS installed on node 192.168.1.3.
muxserver_listen bind(): No such file or directory
muxserver_listen bind(): No such file or directory
ppss@chewey:~> ppss start -C config.cfg
Feb 25 16:29:57:
Feb 25 16:29:57:  =========================================================
Feb 25 16:29:57:                         |P|P|S|S|
Feb 25 16:29:58:  Distributed Parallel Processing Shell Script vers. 2.85
Feb 25 16:29:58:  =========================================================
Feb 25 16:29:58:  Hostname:             chewey
Feb 25 16:29:58:  ---------------------------------------------------------
Feb 25 16:29:58:  Starting PPSS on node 192.168.1.7.
ppss@chewey:~> ppss status -C config.cfg
Feb 25 16:30:05:
Feb 25 16:30:05:  =========================================================
Feb 25 16:30:05:                         |P|P|S|S|
Feb 25 16:30:05:  Distributed Parallel Processing Shell Script vers. 2.85
Feb 25 16:30:05:  =========================================================
Feb 25 16:30:05:  Hostname:             chewey
Feb 25 16:30:05:  ---------------------------------------------------------
mkfifo: cannot create fifo `ppss_dir/ppss-fifo-29903-4282': No such file or 
directory
/usr/bin/ppss: line 1003: ppss_dir/ppss-fifo-29903-4282: No such file or 
directory
/usr/bin/ppss: line 1091: ppss_dir/status.txt: No such file or directory
Feb 25 16:30:06:  CPU: Intel(R) Pentium(R) 4 CPU 1.60GHz
Feb 25 16:30:06:  Found 1 logic processors.
muxserver_listen bind(): No such file or directory
mkdir: cannot create directory `ppss-home/ppss_dir/PPSS_ITEM_LOCK_DIR': No such 
file or directory
Feb 25 16:30:09:  Status:               0 percent complete.
Feb 25 16:30:09:  Nodes:         1
Feb 25 16:30:09:  Items:                4
Feb 25 16:30:09:  ---------------------------------------------------------
Feb 25 16:30:09:  IP-address       Hostname            Processed     Status
Feb 25 16:30:09:  ---------------------------------------------------------
Feb 25 16:30:10:  192.168.1.7      UNKNOWN                     0    UNKNOWN
Feb 25 16:30:10:  ---------------------------------------------------------
Feb 25 16:30:10:  Total processed:                             0
ppss@chewey:~> ll
total 36
drwxr-xr-x 2 ppss users 4096 2011-02-24 18:48 bin
-rwxr-xr-x 1 ppss users  160 2011-02-25 16:26 build_config.sh
-rw-r--r-- 1 ppss users  288 2011-02-25 16:27 config.cfg
drwxr-xr-x 2 ppss users 4096 2011-02-24 18:48 Documents
-rw-r--r-- 1 ppss users    0 2011-02-25 16:27 known_hosts
-rw-r--r-- 1 ppss users   12 2011-02-25 16:12 nodes.txt
drwxr-xr-x 5 ppss users 4096 2011-02-25 16:30 ppss_dir
drwxr-xr-x 2 ppss users 4096 2011-02-25 16:29 ppss-home
drwxr-xr-x 3 ppss users 4096 2011-02-25 15:29 public_html
-rw-r--r-- 1 ppss users   15 2011-02-25 15:49 status.txt
ppss@chewey:~>                        

Original comment by r3su...@gmail.com on 25 Feb 2011 at 10:36

GoogleCodeExporter commented 9 years ago
I believe the ssh is not communicating the correct information between the 
server and the nodes.  I could be wrong of course.  But here is what i see.

192.168.1.3 = server
192.168.1.7 = node

/data/ppss/wav is local to server(192.168.1.3) shared nfs to the 
node(192.168.1.7)

If I deploy the following config:
./ppss -d /data/ppss/wav -c 'lame ' -m 192.168.1.3 -u ppss -k 
/home/ppss/.ssh/id_rsa -n nodes.txt

And run manually the above statement on the single node (single for testing 
purposes) I will recieve the following:

ppss@bluey:~/ppss-home> ./ppss -d /data/ppss/wav -c 'lame ' -m 192.168.1.3 -u 
ppss -k /home/ppss/.ssh/id_rsa -n nodes.txt
Feb 26 01:35:16:
Feb 26 01:35:16:  =========================================================
Feb 26 01:35:16:                         |P|P|S|S|
Feb 26 01:35:16:  Distributed Parallel Processing Shell Script vers. 2.85
Feb 26 01:35:16:  =========================================================
Feb 26 01:35:16:  Hostname:             bluey
Feb 26 01:35:16:  ---------------------------------------------------------
Feb 26 01:35:16:  CPU: Intel(R) Core(TM)2 Quad CPU    Q8400  @ 2.66GHz
Feb 26 01:35:16:  Found 4 logic processors.
Feb 26 01:35:33:  Starting 4 parallel workers.
Feb 26 01:35:33:  ---------------------------------------------------------
Feb 26 01:35:43:  Currently 25 percent complete. Processed 1 of 4.
Feb 26 01:35:44:  Total processing time (hh:mm:ss): 00:00:28
Feb 26 01:35:44:  Finished. Consult ppss_dir/job_log for job output.

However, the ssh log on the server tells me there is ssh authentication.

Feb 26 01:35:06 chewey sshd[8500]: Received disconnect from 192.168.1.7: 11: 
disconnected by user
Feb 26 01:35:11 chewey sshd[8530]: Accepted publickey for ppss from 192.168.1.7 
port 55970 ssh2
Feb 26 01:35:11 chewey sshd[8532]: Received disconnect from 192.168.1.7: 11: 
disconnected by user
Feb 26 01:35:12 chewey sshd[8557]: Accepted publickey for ppss from 192.168.1.7 
port 55971 ssh2
Feb 26 01:35:12 chewey sshd[8559]: Received disconnect from 192.168.1.7: 11: 
disconnected by user
Feb 26 01:35:19 chewey sshd[8584]: Accepted publickey for ppss from 192.168.1.7 
port 55972 ssh2
Feb 26 01:35:20 chewey sshd[8586]: Received disconnect from 192.168.1.7: 11: 
disconnected by user
Feb 26 01:35:22 chewey sshd[8611]: Accepted publickey for ppss from 192.168.1.7 
port 55973 ssh2
Feb 26 01:35:22 chewey sshd[8613]: Received disconnect from 192.168.1.7: 11: 
disconnected by user
Feb 26 01:35:22 chewey sshd[8638]: Accepted publickey for ppss from 192.168.1.7 
port 55974 ssh2
Feb 26 01:35:23 chewey sshd[8640]: Received disconnect from 192.168.1.7: 11: 
disconnected by user

Furthermore, I suspect the ssh and nfs are ok since....

If I run the following on the node:
./ppss -d /data/ppss/wav -c 'lame '

Everything gets processed fine via nfs. mp3 are created.(As if the node was in 
standalone.)  The jobs runs with 1 of 4 , 2 of 4, etc...  Perfectly fine.  
mp3's are create in /data/pps/wav (the source directory where input files are.) 
 Good enough.

ppss@bluey:~/ppss-home> ./ppss -d /data/ppss/wav -c 'lame '
Feb 26 01:42:57:
Feb 26 01:42:57:  =========================================================
Feb 26 01:42:57:                         |P|P|S|S|
Feb 26 01:42:57:  Distributed Parallel Processing Shell Script vers. 2.85
Feb 26 01:42:57:  =========================================================
Feb 26 01:42:57:  Hostname:             bluey
Feb 26 01:42:57:  ---------------------------------------------------------
Feb 26 01:42:57:  CPU: Intel(R) Core(TM)2 Quad CPU    Q8400  @ 2.66GHz
Feb 26 01:42:57:  Found 4 logic processors.
Feb 26 01:43:00:  Starting 4 parallel workers.
Feb 26 01:43:00:  ---------------------------------------------------------
Feb 26 01:43:54:  One job is remaining.
Feb 26 01:43:54:  Total processing time (hh:mm:ss): 00:00:57
Feb 26 01:43:54:  Finished. Consult ppss_dir/job_log for job output.
pp

Original comment by r3su...@gmail.com on 26 Feb 2011 at 7:46

GoogleCodeExporter commented 9 years ago

Original comment by Louwrentius on 16 Mar 2011 at 11:12

GoogleCodeExporter commented 9 years ago
Additional info:

I am also running the server on a 64 bit machine and the clients on 32 bit if 
that makes any difference.

Original comment by r3su...@gmail.com on 18 Mar 2011 at 7:58

GoogleCodeExporter commented 9 years ago
It is rather late but I'm working on a test setup to reproduce all these issues 
and I will see what I can do with it.

Original comment by Louwrentius on 9 Aug 2011 at 7:34

GoogleCodeExporter commented 9 years ago
I believe that this issue is fixed in 2.95.

Original comment by Louwrentius on 25 Dec 2011 at 4:34