rwth-i6 / sisyphus

A Workflow Manager in Python
Mozilla Public License 2.0
45 stars 25 forks source link

Creating Alias and Outputs on startup are slow #213

Closed Atticus1806 closed 5 days ago

Atticus1806 commented 1 month ago

I am currently running into the issue that my manager startup is slowed down by updating all alias and outputs every time. I am wondering, is this even required for the manager to work properly? I guess this can potentially cause files to be "missing" from output and alias, but behaviour itself should be safe right?

https://github.com/rwth-i6/sisyphus/blob/3181d724a999e6ec032656c2fb3e8ae14ed3eb97/sisyphus/manager.py#L551

For context I am currently looking at 52 secs until config load, 280 secs for alias and 1370 secs for outputs. While this probably is also related to slow fs and creating quite a number of outputs/alias these are hard to fix for me right now. In case the behaviour is not endangered I would create a PR adding a flag to disable the full update on startup.

My first test shows that this should work, but since I am not that familiar with the manager loop I want to make sure this does not implicitly break anything.

michelwi commented 1 month ago

is this even required for the manager to work properly? I guess this can potentially cause files to be "missing" from output and alias, but behaviour itself should be safe right?

I think I would agree, in principle the manager could already start without all outputs in place. Unless of cause you are a naughty person and define tk.Paths into your output folder.

disable the full update on startup.

I am not sure when else the full update will be happening.

I guess in cases where you kill the manager to clear Jobs that go into error state, there is not much use in updating everything every time. But when you kill the manager, change your graph and the outputs and then restart it, then we would need to update on startup; otherwise all aliases and outputs would still point to the old versions before the change and (the outputs) will only be updated once the manager finishes https://github.com/rwth-i6/sisyphus/blob/3181d724a999e6ec032656c2fb3e8ae14ed3eb97/sisyphus/manager.py#L632 (assuming you are not impatient like me and hit ctrl+c a couple of times to get the shell back quicker)

Maybe the update could be pushed into a thread that runs in parallel to the manager loop?

JackTemaki commented 1 month ago

Maybe the update could be pushed into a thread that runs in parallel to the manager loop?

This sounds like a good idea.