ojwoodford / batch_job

Parallelize MATLAB for loops across workers, without the Parallel Computing Toolbox
MIT License
18 stars 6 forks source link

platform compatibility: Windows -> "cat" not available #11

Open spotlightgit opened 4 years ago

spotlightgit commented 4 years ago

Hello Oliver,

my SSH Connection is working between both PCs, that means I can ssh without entering a Password, which is one enabler for distributed computing with your wonderful Toolbox. Unfortunately the "cat" command is not known to the Windows command line, which is called by the "system" command from MATLAB. Possible workarounds: 1.) "cat" is known at Windows PowerShell -> Seems interesting to call PowerShell instead of command line, but !powershell cat … or !powershell -inputformat none cat ... are both not working on my Matlab (don't know why). 2.) replace "cat" with "type" -> It seems to have the same functionality like "cat" on Linux Systems. Type is working at command line and PowerShell.

To be compatible to Linux and Windows it could be possible to check with "ispc" and than execute system(sprintf('type … or system(sprintf('cat … Instead of usage of "ispc" is also an additional Option possible for "batch_job_distrib()"

What do you think?

spotlightgit commented 4 years ago

maybe it works if following changes are done: start_workers.m:

% Copy the command file
if ispc
   [status, cmdout] = system(sprintf('scp %s %s:./batch_job_distrib_cmd.bat', cmd_file, workers{w,1}));
else
   [status, cmdout] = system(sprintf('cat %s | ssh %s "cat - > ./batch_job_distrib_cmd.bat"', cmd_file, workers{w,1}));
end

and

% Make it executable
if ~ispc
   [status, cmdout] = system(sprintf('ssh %s "chmod u+x batch_job_distrib_cmd.bat"', workers{w,1}));
   assert(status == 0, cmdout);
end

and

% Add on the ssh command
if ispc
   cmd = sprintf('ssh %s batch_job_distrib_cmd.bat', workers{w,1});
else
   cmd = sprintf('ssh %s ./batch_job_distrib_cmd.bat', workers{w,1});
end

Batch_job_distrib.m

% Remove the command file
try
   if ispc
      [status, cmdout] = system(sprintf('ssh %s "del batch_job_distrib_cmd.bat"', workers{w,1}));
   else
      [status, cmdout] = system(sprintf('ssh %s "rm -f ./batch_job_distrib_cmd.bat"', workers{w,1}));
   end
   assert(status == 0, cmdout);
catch me
ojwoodford commented 4 years ago

Many thanks for the input here. One issue I see is that ispc() tells you wether the master is a PC, but not the worker. The master could be a PC and the worker could be running linux.

spotlightgit commented 4 years ago

Well, I understand your worries. In my case master and worker are both Windows systems. I think this could be solved by adding a new input argument, where the user can select which operating system the worker has. Maybe also an extension of the "workers" option -> "hostname, number of worker, system". Than it is possible to use different operating systems as workers. If not further specified the system of the master could be used as default for the workers. An automatic detection would be the best solution :-) But maybe too much effort for this kind of issue ...

spotlightgit commented 4 years ago

Hey Oliver, have you decided already how you want to continue with this issue? :-)

ojwoodford commented 4 years ago

No. It's a bit tricky. Ideally I'd like to get rid of the file, and just send the command over ssh. But not yet sure how to do this in a platform agnostic way. I'm open to input.

spotlightgit commented 4 years ago

Well, as you mentioned it should be possible to send the commands directly instead of starting a batch file. I am working with MATLAB at Windows, therfore I have no experience regarding the differences between both operating systems for running MATLAB.

spotlightgit commented 4 years ago

Hey Oliver, with my fast running test function I was thinking distributed computing is working at Windows (with my adoptions above), but it is not. It seems that SSH at Windows and Linux have different behaviour. Apparently all applications which are started within the SSH session are closed after closing of the SSH session. In your actual implementation the execution of the batch file is a single command which opens a SSH connection and close it immediately afterwards. Therefore the started MATLAB is also closed immediately. It looks like I missed this behaviour at my initial testing with a fast goal function. Probably the master Matlab was doing the number crunching instead of the workers and I got no errors. Now with my slow goal function I realized this issue. Have you ever used your toolbox at Windows? Do you have any suggestions how this issue could be solved?

ojwoodford commented 4 years ago

I have used it on Windows, but several years ago. The process needs to be disconnected from the shell. I believe the way to do this is using the start command: https://superuser.com/questions/1069972/windows-run-process-on-background-after-closing-cmd/1069983 However, I'm doing this, and you say it doesn't work. It needs further investigation. Unfortunately I don't have time to do this at present.