vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
274 stars 45 forks source link

Master process does not quit after task completion #583

Closed gaow closed 7 years ago

gaow commented 7 years ago

This is not reproducible but I wonder if others have seen the same. I have a long running job that got partially completed then it hangs like this:

INFO: b3d28991abc1c8e58e678d307263ba4c completed
cat output/20170503CellCycleTrueFactors/CellCycleSin_10086_noise_20_20_10086.updates.log | sed -n "/Loading matrix:/,/Loglik:/p" | sed 's/Loglik://g' | grep -v "Loading matrix:" > output/20170503CellCycleTrueFactors/CellCycleSin_10086_noise_20_20_10086.loadings.txt
cat output/20170503CellCycleTrueFactors/CellCycleSin_10086_noise_20_20_10086.updates.log | grep -A1 Loglik | sed -n '2~3p' > output/20170503CellCycleTrueFactors/CellCycleSin_10086_noise_20_20_10086.loglik.txt
echo '' >> output/20170503CellCycleTrueFactors/CellCycleSin_10086_noise_20_20_10086.loglik.txt

INFO: 2382823202a802302fa7231c81724556 completed
INFO: output:   [output/20170503CellCycleTrueFa...actors.txt, ...] (8 items)
Process SoS_Worker-3:
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/miniconda3/lib/python3.6/site-packages/sos-0.9.4-py3.6.egg/sos/sos_executor.py", line 140, in run
    env.logger.debug('Worker {} receives request {}'.format(self.worker_id(), work[0]))
KeyError: 0
Running default_0 (19:37:00): 

Running default_0 (19:38:00): 

Line 140 of sos_executor.py is problematic. Not sure if it can be fixed without a reproducible example ...

BoPeng commented 7 years ago

The worker is supposed to receive a request (an array) but it received a dictionary. Only -v3 even -v4 can show where in the master process this message was sent.

You could remove the [0] after worker in that message to display the complete message although the worker would still fail to process it.

gaow commented 7 years ago

Yes this is not exactly reproducible ... but i'll use -v4 for all of my SoS runs from now on until we can run into this issue again and fix ...

The worker is supposed to receive a request (an array) but it received a dictionary

Why is that?

BoPeng commented 7 years ago

I do not know. There are messages of different types for different purposes. Perhaps I should standardize them (a message protocol) but right now the worker should receive a None to commit suicide if there is no more task.

gaow commented 7 years ago

Not sure if it helps but I should emphasize that my tasks sequences have not completed. There is a couple of tasks remaining. There is nothing wrong with my code but it just hangs there. Then I killed it, rerun it and it completed without an issue. For now i'll keep running with -v4 and report back if I get lucky :)

BoPeng commented 7 years ago

Just pushed a patch to show all message that the worker receives.

gaow commented 7 years ago

Ok here it goes, again. But ... I did not update my SoS so there is still not enough info ...

TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/30c782c83225cdc2228a34ca226afc2b.exe_info
TRACE: Lock released for output files [output/20170504BMNoise/YShapedBM_10086_4..., ...] (4 items)
INFO: 30c782c83225cdc2228a34ca226afc2b completed
TRACE: STATUS c2c88acc99719b73b1d68ad7ce192396  completed

TRACE: STATUS 30c782c83225cdc2228a34ca226afc2b  completed

TRACE: STATUS c1c7e7dd2b2b5c92fa28fc8d0fe66d44  completed

DEBUG: completed: 310 failed: 19 result-mismatch: 53 aborted: 6
DEBUG: Put results for c2c88acc99719b73b1d68ad7ce192396 c1c7e7dd2b2b5c92fa28fc8d0fe66d44 30c782c83225cdc2228a34ca226afc2b
TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/7fcaae424a8c025972f7943df4d67e18.exe_info
TRACE: Lock released for output files [output/20170504BMNoise/LinearBM_10086_2_..., ...] (4 items)
TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/c5f501d0947d3f66dd3f212549e9758b.exe_info
TRACE: Lock released for output files [output/20170504BMNoise/VShapedBM_10086_3..., ...] (4 items)
TRACE: Write signature /home/gaow/Documents/GIT/github/dr-tree/analysis/notebook/.sos/.runtime/ae69ca111b7dbc049d3a4342cca4770a.exe_info
DEBUG: Put results for c2c88acc99719b73b1d68ad7ce192396 c1c7e7dd2b2b5c92fa28fc8d0fe66d44 30c782c83225cdc2228a34ca226afc2b
TRACE: Lock released for output files [output/20170504BMNoise/YShapedBM_10086_4..., ...] (4 items)
INFO: output:   [output/20170504BMNoise/LinearB...actors.txt, ...] (12 items)
DEBUG: Worker 1 completes request step
Process SoS_Worker-3:
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/miniconda3/lib/python3.6/site-packages/sos-0.9.4-py3.6.egg/sos/sos_executor.py", line 140, in run
    env.logger.debug('Worker {} receives request {}'.format(self.worker_id(), work[0]))
KeyError: 0
DEBUG: Master receive a result
DEBUG: Master send res to nested
DEBUG: Nested receive a result
DEBUG: Nested receive step result 
DEBUG: Nested send step 4f5910cc-8b10-4e14-b7b6-a332f5f5a7d6 to master with args {'seed': 10086, 'K': [2, 3, 4], 'color': "c('#377EB8', '#E41A1C', '#4DAF4A', '#984EA3', '#FFD92F', '#FF7F00', '#F781BF', '#8DD3C7', '#B3B3B3', '#000000', '#56B4E9', '#BC80BD', '#FDB462', '#350E20', '#8A9045', '#800000')", 'label': "c('I10', 'I12', 'I11', 'I1', 'I2', 'I13', 'I3', 'I4', 'I14', 'I5', 'I6', 'I15', 'I7', 'I8')", 'grid': 'NULL', 'outdir': 'output/20170504BMNoise', 'tag': '', 'known_start': 'F', 'multiple_start': 'F', 'residual_scale': 500000, '__args__': []}
DEBUG: Master receives step request 4f5910cc-8b10-4e14-b7b6-a332f5f5a7d6 with args {'seed': 10086, 'K': [2, 3, 4], 'color': "c('#377EB8', '#E41A1C', '#4DAF4A', '#984EA3', '#FFD92F', '#FF7F00', '#F781BF', '#8DD3C7', '#B3B3B3', '#000000', '#56B4E9', '#BC80BD', '#FDB462', '#350E20', '#8A9045', '#800000')", 'label': "c('I10', 'I12', 'I11', 'I1', 'I2', 'I13', 'I3', 'I4', 'I14', 'I5', 'I6', 'I15', 'I7', 'I8')", 'grid': 'NULL', 'outdir': 'output/20170504BMNoise', 'tag': '', 'known_start': 'F', 'multiple_start': 'F', 'residual_scale': 500000, '__args__': []}
DEBUG: Master execute 4f5910cc-8b10-4e14-b7b6-a332f5f5a7d6 from step queue with args {'seed': 10086, 'K': [2, 3, 4], 'color': "c('#377EB8', '#E41A1C', '#4DAF4A', '#984EA3', '#FFD92F', '#FF7F00', '#F781BF', '#8DD3C7', '#B3B3B3', '#000000', '#56B4E9', '#BC80BD', '#FDB462', '#350E20', '#8A9045', '#800000')", 'label': "c('I10', 'I12', 'I11', 'I1', 'I2', 'I13', 'I3', 'I4', 'I14', 'I5', 'I6', 'I15', 'I7', 'I8')", 'grid': 'NULL', 'outdir': 'output/20170504BMNoise', 'tag': '', 'known_start': 'F', 'multiple_start': 'F', 'residual_scale': 500000, '__args__': []}
Running default_0 (00:35:00): 
Running default_0 (00:36:00): 

But this example seems to reproduce it. Will try another round later.

BoPeng commented 7 years ago

Too bad there is no details of the message. Please update and try again.

gaow commented 7 years ago

Here we go, after update -- though I do not think it is informative, I at least now how it is reproduced.

DEBUG: Put results for c2c88acc99719b73b1d68ad7ce192396 c1c7e7dd2b2b5c92fa28fc8d0fe66d44 30c782c83225cdc2228a34ca226afc2b
TRACE: Lock released for output files [output/20170504BMNoise/YShapedBM_10086_4..., ...] (4 items)
INFO: output:   [output/20170504BMNoise/LinearB...actors.txt, ...] (12 items)
DEBUG: Worker 1 completes request step
DEBUG: Worker 1 receives request {'c2c88acc99719b73b1d68ad7ce192396': {'ret_code': 0, 'output': {'output/20170504BMNoise/LinearBM_10086_2_10086.factors.txt': '85d4b76f3134b30a33a4e6c5ce34199a', 'output/20170504BMNoise/LinearBM_10086_2_10086.factor_weights.txt': '9e7d6a7b3efaa0f491c6ea11829ff2b0', 'output/20170504BMNoise/LinearBM_10086_2_10086.loadings.txt': '816a8e539e4ff0687319f9f02405fae7', 'output/20170504BMNoise/LinearBM_10086_2_10086.loglik.txt': 'f2194630592b4c46102731204d8dc6f6'}}, 'c1c7e7dd2b2b5c92fa28fc8d0fe66d44': {'ret_code': 0, 'output': {'output/20170504BMNoise/VShapedBM_10086_3_10086.factors.txt': '519c70a415a87f21be0ec154d7dd24e4', 'output/20170504BMNoise/VShapedBM_10086_3_10086.factor_weights.txt': 'ae1141a1c20aa2357744ac6b71e6d85a', 'output/20170504BMNoise/VShapedBM_10086_3_10086.loadings.txt': '91ef9e19ec976ba25aeb870c25d7caa8', 'output/20170504BMNoise/VShapedBM_10086_3_10086.loglik.txt': 'cf0c30cf42d4b89769fc8e99cc28917f'}}, '30c782c83225cdc2228a34ca226afc2b': {'ret_code': 0, 'output': {'output/20170504BMNoise/YShapedBM_10086_4_10086.factors.txt': 'fb720a0a522bfd05fb2f51d21e754795', 'output/20170504BMNoise/YShapedBM_10086_4_10086.factor_weights.txt': '5e9a4184eddc73e0dbbf0f82067d578b', 'output/20170504BMNoise/YShapedBM_10086_4_10086.loadings.txt': 'aa1505e092bc0cf41422b0e2ad4da5ca', 'output/20170504BMNoise/YShapedBM_10086_4_10086.loglik.txt': 'a3c5838cee9e190e56987c37e2e293ab'}}}
Process SoS_Worker-3:
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/opt/miniconda3/lib/python3.6/site-packages/sos-0.9.4-py3.6.egg/sos/sos_executor.py", line 142, in run
    if work[0] == 'step':
KeyError: 0
DEBUG: Master receive a result
DEBUG: Master send res to nested
DEBUG: Nested receive a result
DEBUG: Nested receive step result 
DEBUG: Nested send step f33166f5-c038-46d7-a5a7-7b71ae57e654 to master with args {'seed': 10086, 'K' ...

Running default_0 (00:25:00): 
BoPeng commented 7 years ago

Worker one received a result instead of another request. Interesting.

BoPeng commented 7 years ago

I understand what happened but not why. Could you send me your script, at least the skeleton (which step sends tasks etc) so that I can try to reproduce it?

BoPeng commented 7 years ago

Or at least send me complete trace.

gaow commented 7 years ago

Sorry I just got home from a defense reception .... sure I can send you my script! Will do it first thing tomorrow. Thanks a lot!

gaow commented 7 years ago

@BoPeng sorry got side-tracked by emails. Here is the complete set of script (there are irrelevant files just ignore them); the command is sos run main.sos -J8 -j8.

test_hang.zip

I hope it can be reproduced. At the end of the day you should see some plots but typically it would hang before generating any plots, then you kill it, run it again, it will generate the plots.

BoPeng commented 7 years ago

Cannot compile due to lack of openmp on clang. brew install clang-omp did not help.

gaow commented 7 years ago

Would this tip help?

https://github.com/jobovy/extreme-deconvolution#installation-faq

BoPeng commented 7 years ago

I am asked for gsl, which I am too lazy to download....

BoPeng commented 7 years ago

It would be a good idea to create a docker image for the whole thing...

BoPeng commented 7 years ago

It looks like a very simple = to == typo, please check if the bug is gone.

gaow commented 7 years ago

Indeed! I know you can blind debug this one, I know it!!

Closing it for now until otherwise.