Closed sg-s closed 4 years ago
But weirdly, running demo_xfit
never fails, even though it runs code in parallel.
Turning on verbosity shows that for whatever weird reason, xolotl is recompiling on every parallel worker. This is not good!
ANd it still RANDOMLY fails...
Source of error: the hash is empty, for whatever reason, so the mex file's path is invalid, and MATLAB complains about an invalid mex file.
So problems:
I don't understand it, but adding a x.integrate outside the parfor seems to fix everything...
My guess is that by calling x.integrate outside the parfor loop, you're compiling/hashing and then the threads in parallel can use that compiled code. Parfor in MATLAB is weird when it has to read files. For example, you can't load .mat files unless they're being loaded into a data structure while in a parfor loop.
On Tue, Jun 16, 2020 at 5:28 am, Srinivas Gorur-Shandilya notifications@github.com wrote:
I don't understand it, but adding a x.integrate outside the parfor seems to fix everything...
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sg-s/xolotl/issues/525#issuecomment-644731203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHGXS3QJ3PPID3GBT5Z35I3RW5QO5ANCNFSM4N7RZ6FQ.
But the model is already compiled. For mysterious reasons, in the parfor loop, it can't "see" the mex file -- exist('mex file') returns 0
Further mystery:
exist('~/code/xolotl/X_7786eb356c0b6610992377cd56f82dd3.mexmaci64')
returns 2, not 3, as one would expect
More clues -- somewhere in the parallel call chain, xolotl.loadobj is being called. Does MATLAB really save objects to disk and load them in parallel workers? Is this how data is passed around?
[SOLVED]
Data is saved to disk and loaded b/w parallel workers -- and during this shuffle, the hash is set to an empty string. A guard for this in loadobj
fixes this
This is still an issue -- because N different workers are compiling and writing to the same file. Based on when workers finish the compilation, the MEX file can get corrupted.
The best solution is simply to disallow compiling by parallel workers. We will throw an informative error.
Running
demo_parallel
randomly fails, with these errors:or this one:
or this one: