multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

muscle3 profile command fails because it cannot find entry in perfromance.sqlite database associated with a name of an instance #272

Open YehorYudinIPP opened 9 months ago

YehorYudinIPP commented 9 months ago

Calling muscle3 profile -r performance.sqlite fails with KeyError: 'stop', where stop is the name of a workflow instance. The version of MUSCLE3 library is 0.7.0

The total error stack in Python is:

Traceback (most recent call last): File "/u/yyudin/conda-envs/python3114/bin/muscle3", line 8, in sys.exit(muscle3()) ^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/muscle3/muscle3.py", line 73, in profile plot_resources(Path(performance_file)) File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/muscle3/profiling.py", line 52, in plot_resources stats = db.resource_stats() ^^^^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/libmuscle/manager/profile_database.py", line 161, in resource_stats instances, run_times, commtimes, = self.instance_stats() ^^^^^^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/libmuscle/manager/profile_database.py", line 131, in instance_stats total_times = [(stop_run[i] - start_run[i]) 1e-9 for i in instances] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/u/yyudin/conda-envs/python3114/lib/python3.11/site-packages/libmuscle/manager/profile_database.py", line 131, in total_times = [(stop_run[i] - start_run[i]) * 1e-9 for i in instances]


KeyError: 'stop'
LourensVeen commented 9 months ago

It looks like this instance started its run but didn't shut down cleanly. When that happens, the shutdown isn't recorded in the database, and so MUSCLE3 doesn't know how long it ran and cannot calculate which percentage of the core's time it used.

There should be a test and a nice error message here at least, so thanks for reporting this!

Could you check the log and see if it says anything about the stop instance crashing or shutting down because something else crashed?

Also, which version of MUSCLE3 are you using?

YehorYudinIPP commented 9 months ago

Thanks Lourens! I updated the comment, it's MUSCLE3 0.7.0 Indeed, the workflow failed due to an error in a turbulence_sim component:

muscle_manager 2023-10-06 00:15:14,185 ERROR libmuscle.manager.instance_manag er: Instance turbulence_sim quit with error 38

Which in its turn failed due to overcrowding my cluster's hard drive, unfortunately:

forrtl: Disk quota exceeded forrtl: severe (38): error during write, unit 42, file /cobra/u/yyudin/code/MFW /muscle3/workflow/run_fusion_gem_multiimpl_20231005_criteria/instances/turbulen ce_sim/workdir/p02.dat

LourensVeen commented 9 months ago

Okay, yes, then the issue here is that there should be a better error message. I'll go fix that. Note that 0.7.1 is out with several fixes to the profiling system (including that muscle3 profile -t is now working), so you may want to upgrade :smile: