ojwoodford / batch_job

Parallelize MATLAB for loops across workers, without the Parallel Computing Toolbox
MIT License
18 stars 6 forks source link

Error using load #9

Closed spotlightgit closed 4 years ago

spotlightgit commented 4 years ago

Before I apply the parallelization on my lenghty goal functions, I want to check my optimization algorithm based on some easy to evaluate test functions. If I do so, I always get an error file after each call of batch_job_distrib with this content: Error using load Unable to read file 'D:/.../tp12e45761_47d1_4658_9350_ecd6c93a7c37.mat'. No such file or directory. I assume the reason is, that one worker is faster than the other one and evaluates the last job while an other worker also want to evaluate the same, but is too late. I further assume that this is a warning which could be ignored cause everything else works fine. Is there any possibility to avoid the creation of this error file? If the evaluation durations of the goal functions are getting longer, this issue is getting less important, but from my understanding in theorey this could happen "every time". What would be your suggestion?

ojwoodford commented 4 years ago

There are several concurrency issues that I haven't been able to fix. But this could be a bug that is fixable. Do you have a test script which can reliably reproduce the error? Are your workers on the same machine as the master, or a different machine?

spotlightgit commented 4 years ago

Right now I have my workers on a single machine (multiple machines in the next days/weeks)

Here is a test script, which reproduces the error:

x = rand(3,10);
out = batch_job_distrib(@Rastrigin, x, {'',2}, '-chunk_lims', [1 1]);

and the corresponding goal function:

function [y] = Rastrigin(x)
n = length(x); 
s = 0;
for j = 1:n
    s = s+(x(j)^2-10*cos(2*pi*x(j))); 
end
y = 10*n+s;
ojwoodford commented 4 years ago

Many thanks for the script. That helped a lot.

The issue was that the master worker finished the job before any workers could start. This shouldn't record an error.

I have made and pushed a fix.