sg-s / xolotl

A MATLAB neuron simulator. Very fast (written in C++). Flexible (fully object oriented). Immediate (live manipulation in MATLAB). Comes with a powerful parameter optimizer. Get started ➡️
https://go.brandeis.edu/xolotl
GNU General Public License v3.0
43 stars 8 forks source link

running in parallel is broken for mysterious reasons #525

Closed sg-s closed 4 years ago

sg-s commented 4 years ago

Running demo_parallel randomly fails, with these errors:

Error using xolotl/compile (line 55)
The command '/usr/bin/xcrun' exited with a return value '0'

Error in xolotl/integrate (line 109)
    self.compile;

Error in demo_parallel (line 38)
parfor i = 1:length(all_params)

or this one:

Error using xolotl/compile (line 55)
'/Users/srinivas/Documents/MATLAB/Add-Ons/Toolboxes/xolotl/code/X_7786eb356c0b6610992377cd56f82dd3.mexmaci64'
is not a MEX file. For more information, see File is not a MEX file.

Error in xolotl/integrate (line 109)
    self.compile;

Error in demo_parallel (line 38)
parfor i = 1:length(all_params)

or this one:

Error using xolotl/integrate (line 186)
Invalid MEX-file
'/Users/srinivas/Documents/MATLAB/Add-Ons/Toolboxes/xolotl/code/X_7786eb356c0b6610992377cd56f82dd3.mexmaci64':
dlopen(/Users/srinivas/Documents/MATLAB/Add-Ons/Toolboxes/xolotl/code/X_7786eb356c0b6610992377cd56f82dd3.mexmaci64,
6): no suitable image found.  Did find:
    /Users/srinivas/Documents/MATLAB/Add-Ons/Toolboxes/xolotl/code/X_7786eb356c0b6610992377cd56f82dd3.mexmaci64:
        code signature in
        (/Users/srinivas/Documents/MATLAB/Add-Ons/Toolboxes/xolotl/code/X_7786eb356c0b6610992377cd56f82dd3.mexmaci64)
        not valid for use in process using Library Validation: library load
        disallowed by system policy.

Error in demo_parallel (line 41)
parfor i = 1:length(all_params)
sg-s commented 4 years ago

But weirdly, running demo_xfit never fails, even though it runs code in parallel.

sg-s commented 4 years ago

Turning on verbosity shows that for whatever weird reason, xolotl is recompiling on every parallel worker. This is not good!

sg-s commented 4 years ago

ANd it still RANDOMLY fails...

sg-s commented 4 years ago

Source of error: the hash is empty, for whatever reason, so the mex file's path is invalid, and MATLAB complains about an invalid mex file.

So problems:

  1. WHy does it recompile every time in parallel?
  2. Why is the hash empty?
sg-s commented 4 years ago

I don't understand it, but adding a x.integrate outside the parfor seems to fix everything...

alec-hoyland commented 4 years ago

My guess is that by calling x.integrate outside the parfor loop, you're compiling/hashing and then the threads in parallel can use that compiled code. Parfor in MATLAB is weird when it has to read files. For example, you can't load .mat files unless they're being loaded into a data structure while in a parfor loop.

On Tue, Jun 16, 2020 at 5:28 am, Srinivas Gorur-Shandilya notifications@github.com wrote:

I don't understand it, but adding a x.integrate outside the parfor seems to fix everything...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sg-s/xolotl/issues/525#issuecomment-644731203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHGXS3QJ3PPID3GBT5Z35I3RW5QO5ANCNFSM4N7RZ6FQ.

sg-s commented 4 years ago

But the model is already compiled. For mysterious reasons, in the parfor loop, it can't "see" the mex file -- exist('mex file') returns 0

sg-s commented 4 years ago

Further mystery:

exist('~/code/xolotl/X_7786eb356c0b6610992377cd56f82dd3.mexmaci64')

returns 2, not 3, as one would expect

sg-s commented 4 years ago

More clues -- somewhere in the parallel call chain, xolotl.loadobj is being called. Does MATLAB really save objects to disk and load them in parallel workers? Is this how data is passed around?

sg-s commented 4 years ago

[SOLVED]

Data is saved to disk and loaded b/w parallel workers -- and during this shuffle, the hash is set to an empty string. A guard for this in loadobj fixes this

sg-s commented 4 years ago

This is still an issue -- because N different workers are compiling and writing to the same file. Based on when workers finish the compilation, the MEX file can get corrupted.

The best solution is simply to disallow compiling by parallel workers. We will throw an informative error.