rwth-i6 / sisyphus

A Workflow Manager in Python
Mozilla Public License 2.0
45 stars 25 forks source link

Wait time after (or between) mini jobs #146

Open albertz opened 1 year ago

albertz commented 1 year ago

My recognition + scoring + summarizing results pipeline contains a number of very simple mini jobs which run one after another. The wait time after a mini job is annoying. I don't want to have any wait time between two consecutive mini jobs.

I'm not sure how others think about this (excluding those who just don't care). What are the reasons for having this wait time? I'm specifically talking about mini jobs, specifically jobs run via LocalEngine.

I haven't looked too deep into the code yet. Some things I consider changing:

Maybe such changes are not wanted by everyone, so then this should be optional and configurable somehow?

michelwi commented 1 year ago

But the localengine uses the same filesystem as the others.. why should this be different here? If it works, you could just change the wait periods to a lower value, no?

albertz commented 1 year ago

I don't really know why we have WAIT_PERIOD_JOB_FS_SYNC and co. Apparently there was some issue before? My assumption was that the sync across nodes could be delayed? But I'm certain that there is no delay within a single node. So consecutive minijobs running on the local node will in all cases directly see the changes in the FS. So that is why I suggested to have this exception for consecutive jobs on LocalEngine.

And the WAIT_PERIOD_BETWEEN_CHECKS is anyway different. I don't just want to reduce that. I think this would be a bad idea for the cluster, or no? I just want that local jobs can interrupt the time.sleep.

critias commented 1 year ago

WAIT_PERIOD_JOB_FS_SYNC is there to avoid filesystem problems with filesystem synchronisation. The problem this wait period fixes is that if Job 1 finished on Node A writing some file and Job 2 starts immediately afterwards on Node B and tried to access this file it often failed, since the Network Filesystem client on Node B doesn't know yet that the file exists. Waiting some time between the jobs fixed this. If all jobs run on the same node this shouldn't be a problem, so if the previous job and the current job run on the same node this time out could be skipped.

WAIT_PERIOD_BETWEEN_CHECKS is mainly there to request the state of all job from the cluster engine, this would also be no problem to reduce for the LocalEngine, but since this is implemented using sleep in the main loop it's currently not possible to only reduce it for the LocalEngine separately.

albertz commented 1 year ago

Do you have a suggestion how to implement the logic for WAIT_PERIOD_JOB_FS_SYNC? Basically the wait should only be skipped if it is between two jobs from the LocalEngine.

In Task.finished, there is this code:

minimal_time_since_change = 0
if not gs.SKIP_IS_FINISHED_TIMEOUT:
    minimal_time_since_change = gs.WAIT_PERIOD_JOB_FS_SYNC + gs.WAIT_PERIOD_JOB_CLEANUP

I'm not sure if it is a good idea to have the WAIT_PERIOD_JOB_FS_SYNC logic here in Task.finished. Here we cannot really implement what I suggested. I think we don't even know here what engine this is running on? I think this has to be moved to the manager.

Then, in EngineBase, there would be a new function allow_skip_job_fs_sync_wait_time or so.

but since this is implemented using sleep in the main loop it's currently not possible to only reduce it for the LocalEngine separately.

Of course you would replace the time.sleep by sth like threading.Event or so then, and the local engine, once a job finishes, it can interrupt the sleep.

critias commented 1 year ago

I think it's not a simple change, the code got a little convoluted over time and the engine was never part of the equation.

The state is checked in a pull fashion, each job asks if all inputs paths are available, these path now check if their creator is finished or not. The engine was never meant to be part of the computation.

To get them you would have to get the last task of the running/finished job and the first task of the next job. Thinking about this, the Job.tasks function in meant to be called only after all inputs are available. So you should only get the first task of the next job if all inputs are available (ignoring the timeouts for local jobs). This isn't a nice task...

An alternative: A while ago Mirko and I worked on this PR when we had serious trouble with the NFS and many jobs crashed due to file sync issues: https://github.com/rwth-i6/sisyphus/commits/check-output-size/sisyphus/job.py The idea was to log the size of all output files and the worker just checks before running the task if all input files are available and have the right size. I guess this could also be used to just drop WAIT_PERIOD_JOB_FS_SYNC, but it would increase the number accesses to the filesystem. It was working in my test runs, but we dropped working on it since the file system problems went away and I wasn't sure if increasing the number of filesystem accesses wouldn't cause more problems than it solved. Also it would have needed more testing to make sure it really works reliable, but if it works it would be possible to drop WAIT_PERIOD_JOB_FS_SYNC for all jobs.