Closed gaow closed 7 years ago
The worker is supposed to receive a request (an array) but it received a dictionary. Only -v3
even -v4
can show where in the master process this message was sent.
You could remove the [0]
after worker
in that message to display the complete message although the worker would still fail to process it.
Yes this is not exactly reproducible ... but i'll use -v4 for all of my SoS runs from now on until we can run into this issue again and fix ...
The worker is supposed to receive a request (an array) but it received a dictionary
Why is that?
I do not know. There are messages of different types for different purposes. Perhaps I should standardize them (a message protocol) but right now the worker should receive a None
to commit suicide if there is no more task.
Not sure if it helps but I should emphasize that my tasks sequences have not completed. There is a couple of tasks remaining. There is nothing wrong with my code but it just hangs there. Then I killed it, rerun it and it completed without an issue. For now i'll keep running with -v4
and report back if I get lucky :)
Just pushed a patch to show all message that the worker receives.
Ok here it goes, again. But ... I did not update my SoS so there is still not enough info ...
TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/30c782c83225cdc2228a34ca226afc2b.exe_info
TRACE: Lock released for output files [output/20170504BMNoise/YShapedBM_10086_4..., ...] (4 items)
INFO: 30c782c83225cdc2228a34ca226afc2b completed
TRACE: STATUS c2c88acc99719b73b1d68ad7ce192396 completed
TRACE: STATUS 30c782c83225cdc2228a34ca226afc2b completed
TRACE: STATUS c1c7e7dd2b2b5c92fa28fc8d0fe66d44 completed
DEBUG: completed: 310 failed: 19 result-mismatch: 53 aborted: 6
DEBUG: Put results for c2c88acc99719b73b1d68ad7ce192396 c1c7e7dd2b2b5c92fa28fc8d0fe66d44 30c782c83225cdc2228a34ca226afc2b
TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/7fcaae424a8c025972f7943df4d67e18.exe_info
TRACE: Lock released for output files [output/20170504BMNoise/LinearBM_10086_2_..., ...] (4 items)
TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/c5f501d0947d3f66dd3f212549e9758b.exe_info
TRACE: Lock released for output files [output/20170504BMNoise/VShapedBM_10086_3..., ...] (4 items)
TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/ae69ca111b7dbc049d3a4342cca4770a.exe_info
DEBUG: Put results for c2c88acc99719b73b1d68ad7ce192396 c1c7e7dd2b2b5c92fa28fc8d0fe66d44 30c782c83225cdc2228a34ca226afc2b
TRACE: Lock released for output files [output/20170504BMNoise/YShapedBM_10086_4..., ...] (4 items)
INFO: output: [output/20170504BMNoise/LinearB...actors.txt, ...] (12 items)
DEBUG: Worker 1 completes request step
Process SoS_Worker-3:
Traceback (most recent call last):
File "/opt/miniconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/miniconda3/lib/python3.6/site-packages/sos-0.9.4-py3.6.egg/sos/sos_executor.py", line 140, in run
env.logger.debug('Worker {} receives request {}'.format(self.worker_id(), work[0]))
KeyError: 0
DEBUG: Master receive a result
DEBUG: Master send res to nested
DEBUG: Nested receive a result
DEBUG: Nested receive step result
DEBUG: Nested send step 4f5910cc-8b10-4e14-b7b6-a332f5f5a7d6 to master with args {'seed': 10086, 'K': [2, 3, 4], 'color': "c('#377EB8', '#E41A1C', '#4DAF4A', '#984EA3', '#FFD92F', '#FF7F00', '#F781BF', '#8DD3C7', '#B3B3B3', '#000000', '#56B4E9', '#BC80BD', '#FDB462', '#350E20', '#8A9045', '#800000')", 'label': "c('I10', 'I12', 'I11', 'I1', 'I2', 'I13', 'I3', 'I4', 'I14', 'I5', 'I6', 'I15', 'I7', 'I8')", 'grid': 'NULL', 'outdir': 'output/20170504BMNoise', 'tag': '', 'known_start': 'F', 'multiple_start': 'F', 'residual_scale': 500000, '__args__': []}
DEBUG: Master receives step request 4f5910cc-8b10-4e14-b7b6-a332f5f5a7d6 with args {'seed': 10086, 'K': [2, 3, 4], 'color': "c('#377EB8', '#E41A1C', '#4DAF4A', '#984EA3', '#FFD92F', '#FF7F00', '#F781BF', '#8DD3C7', '#B3B3B3', '#000000', '#56B4E9', '#BC80BD', '#FDB462', '#350E20', '#8A9045', '#800000')", 'label': "c('I10', 'I12', 'I11', 'I1', 'I2', 'I13', 'I3', 'I4', 'I14', 'I5', 'I6', 'I15', 'I7', 'I8')", 'grid': 'NULL', 'outdir': 'output/20170504BMNoise', 'tag': '', 'known_start': 'F', 'multiple_start': 'F', 'residual_scale': 500000, '__args__': []}
DEBUG: Master execute 4f5910cc-8b10-4e14-b7b6-a332f5f5a7d6 from step queue with args {'seed': 10086, 'K': [2, 3, 4], 'color': "c('#377EB8', '#E41A1C', '#4DAF4A', '#984EA3', '#FFD92F', '#FF7F00', '#F781BF', '#8DD3C7', '#B3B3B3', '#000000', '#56B4E9', '#BC80BD', '#FDB462', '#350E20', '#8A9045', '#800000')", 'label': "c('I10', 'I12', 'I11', 'I1', 'I2', 'I13', 'I3', 'I4', 'I14', 'I5', 'I6', 'I15', 'I7', 'I8')", 'grid': 'NULL', 'outdir': 'output/20170504BMNoise', 'tag': '', 'known_start': 'F', 'multiple_start': 'F', 'residual_scale': 500000, '__args__': []}
Running default_0 (00:35:00):
Running default_0 (00:36:00):
But this example seems to reproduce it. Will try another round later.
Too bad there is no details of the message. Please update and try again.
Here we go, after update -- though I do not think it is informative, I at least now how it is reproduced.
DEBUG: Put results for c2c88acc99719b73b1d68ad7ce192396 c1c7e7dd2b2b5c92fa28fc8d0fe66d44 30c782c83225cdc2228a34ca226afc2b
TRACE: Lock released for output files [output/20170504BMNoise/YShapedBM_10086_4..., ...] (4 items)
INFO: output: [output/20170504BMNoise/LinearB...actors.txt, ...] (12 items)
DEBUG: Worker 1 completes request step
DEBUG: Worker 1 receives request {'c2c88acc99719b73b1d68ad7ce192396': {'ret_code': 0, 'output': {'output/20170504BMNoise/LinearBM_10086_2_10086.factors.txt': '85d4b76f3134b30a33a4e6c5ce34199a', 'output/20170504BMNoise/LinearBM_10086_2_10086.factor_weights.txt': '9e7d6a7b3efaa0f491c6ea11829ff2b0', 'output/20170504BMNoise/LinearBM_10086_2_10086.loadings.txt': '816a8e539e4ff0687319f9f02405fae7', 'output/20170504BMNoise/LinearBM_10086_2_10086.loglik.txt': 'f2194630592b4c46102731204d8dc6f6'}}, 'c1c7e7dd2b2b5c92fa28fc8d0fe66d44': {'ret_code': 0, 'output': {'output/20170504BMNoise/VShapedBM_10086_3_10086.factors.txt': '519c70a415a87f21be0ec154d7dd24e4', 'output/20170504BMNoise/VShapedBM_10086_3_10086.factor_weights.txt': 'ae1141a1c20aa2357744ac6b71e6d85a', 'output/20170504BMNoise/VShapedBM_10086_3_10086.loadings.txt': '91ef9e19ec976ba25aeb870c25d7caa8', 'output/20170504BMNoise/VShapedBM_10086_3_10086.loglik.txt': 'cf0c30cf42d4b89769fc8e99cc28917f'}}, '30c782c83225cdc2228a34ca226afc2b': {'ret_code': 0, 'output': {'output/20170504BMNoise/YShapedBM_10086_4_10086.factors.txt': 'fb720a0a522bfd05fb2f51d21e754795', 'output/20170504BMNoise/YShapedBM_10086_4_10086.factor_weights.txt': '5e9a4184eddc73e0dbbf0f82067d578b', 'output/20170504BMNoise/YShapedBM_10086_4_10086.loadings.txt': 'aa1505e092bc0cf41422b0e2ad4da5ca', 'output/20170504BMNoise/YShapedBM_10086_4_10086.loglik.txt': 'a3c5838cee9e190e56987c37e2e293ab'}}}
Process SoS_Worker-3:
Traceback (most recent call last):
File "/opt/miniconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/opt/miniconda3/lib/python3.6/site-packages/sos-0.9.4-py3.6.egg/sos/sos_executor.py", line 142, in run
if work[0] == 'step':
KeyError: 0
DEBUG: Master receive a result
DEBUG: Master send res to nested
DEBUG: Nested receive a result
DEBUG: Nested receive step result
DEBUG: Nested send step f33166f5-c038-46d7-a5a7-7b71ae57e654 to master with args {'seed': 10086, 'K' ...
Running default_0 (00:25:00):
Worker one received a result instead of another request. Interesting.
I understand what happened but not why. Could you send me your script, at least the skeleton (which step sends tasks etc) so that I can try to reproduce it?
Or at least send me complete trace.
Sorry I just got home from a defense reception .... sure I can send you my script! Will do it first thing tomorrow. Thanks a lot!
@BoPeng sorry got side-tracked by emails. Here is the complete set of script (there are irrelevant files just ignore them); the command is sos run main.sos -J8 -j8
.
I hope it can be reproduced. At the end of the day you should see some plots but typically it would hang before generating any plots, then you kill it, run it again, it will generate the plots.
Cannot compile due to lack of openmp on clang. brew install clang-omp
did not help.
Would this tip help?
https://github.com/jobovy/extreme-deconvolution#installation-faq
I am asked for gsl, which I am too lazy to download....
It would be a good idea to create a docker image for the whole thing...
It looks like a very simple =
to ==
typo, please check if the bug is gone.
Indeed! I know you can blind debug this one, I know it!!
Closing it for now until otherwise.
This is not reproducible but I wonder if others have seen the same. I have a long running job that got partially completed then it hangs like this:
Line 140 of
sos_executor.py
is problematic. Not sure if it can be fixed without a reproducible example ...