kbase / JobRunner

KBase Job Runner
MIT License
0 stars 5 forks source link

Strange behavior when EE2 throws an exception from finish_job #43

Open MrCreosote opened 3 years ago

MrCreosote commented 3 years ago

I recently introduced a bug such that EE2 would throw an exception from the finish_job method because catalog_utils.catalog is not an extant field: https://github.com/kbase/execution_engine2/blob/aecddbf3357eba07c2da92a1538387a2779e6ea1/lib/execution_engine2/sdk/EE2Status.py#L549

However, the job runner, when running an app in the narrative, seemed to ignore the exception - it didn't print a stack trace, at least. Instead it logged the following:

Job is done
1615180489.855946:ERROR:jr:WARNING: Module DataFileUtil was already used once for this job. Using cached version: url: https://github.com/kbaseapps/DataFileUtil commit: b640d1ce826f46d1bff497a2f475a4c7de016a18 version: 0.1.4 release: release
INFO:root:Running docker container for image: dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:About to run 34f56f72-7fcd-11eb-849a-02ac072d09e7 dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:Container id is b450fe482fc4f5315485edb4aa127b6deff68c644002064bc30230132ec880f3
1615180494.837040:ERROR:jr:WARNING: Module DataFileUtil was already used once for this job. Using cached version: url: https://github.com/kbaseapps/DataFileUtil commit: b640d1ce826f46d1bff497a2f475a4c7de016a18 version: 0.1.4 release: release
INFO:root:Running docker container for image: dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:About to run 37eeefaa-7fcd-11eb-849a-02ac072d09e7 dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:Container id is 93eb0680402df0c88d3e2066f64efb77a08df1fcde02701ed59ee21767fae635
1615180499.823481:ERROR:jr:WARNING: Module DataFileUtil was already used once for this job. Using cached version: url: https://github.com/kbaseapps/DataFileUtil commit: b640d1ce826f46d1bff497a2f475a4c7de016a18 version: 0.1.4 release: release
INFO:root:Running docker container for image: dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:About to run 3ae7dc9e-7fcd-11eb-849a-02ac072d09e7 dockerhub-ci.kbase.us/kbase:datafileutil.b640d1ce826f46d1bff497a2f475a4c7de016a18
INFO:root:Container id is 36ea0f819dd8fcb5a2534e5cafd6775db56841afab80d199cd0ef3413100862e
INFO:sanic.root:Stopping worker [11342]
INFO:sanic.root:Server Stopped
INFO:jr:Logger initialized for 6045b269172cf13980dabe9a
The monitor jobrunner script killed your job. It is possible the node got overloaded with too many jobs! Memory={'total': 33729970176, 'available': 31053275136, 'percent': 7.9, 'used': 2393518080, 'free': 8088121344, 'active': 2393497600, 'inactive': 19168612352, 'buffers': 136212480, 'cached': 23112118272, 'shared': 1810432, 'slab': 3632816128} CPU=26.1 

After fixing the bug, the logs ended after the Job is done line.

I would expect the job runner to print the exception and some sort of severe warning message that it couldn't complete the job properly.

MrCreosote commented 3 years ago

I'm guessing this line needs a try/except and logging: https://github.com/kbase/JobRunner/blob/master/JobRunner/JobRunner.py#L293