Open albertz opened 1 year ago
But the localengine uses the same filesystem as the others.. why should this be different here? If it works, you could just change the wait periods to a lower value, no?
I don't really know why we have WAIT_PERIOD_JOB_FS_SYNC
and co. Apparently there was some issue before? My assumption was that the sync across nodes could be delayed? But I'm certain that there is no delay within a single node. So consecutive minijobs running on the local node will in all cases directly see the changes in the FS. So that is why I suggested to have this exception for consecutive jobs on LocalEngine
.
And the WAIT_PERIOD_BETWEEN_CHECKS
is anyway different. I don't just want to reduce that. I think this would be a bad idea for the cluster, or no? I just want that local jobs can interrupt the time.sleep
.
WAIT_PERIOD_JOB_FS_SYNC
is there to avoid filesystem problems with filesystem synchronisation. The problem this wait period fixes is that if Job 1
finished on Node A
writing some file and Job 2
starts immediately afterwards on Node B
and tried to access this file it often failed, since the Network Filesystem client on Node B
doesn't know yet that the file exists. Waiting some time between the jobs fixed this. If all jobs run on the same node this shouldn't be a problem, so if the previous job and the current job run on the same node this time out could be skipped.
WAIT_PERIOD_BETWEEN_CHECKS
is mainly there to request the state of all job from the cluster engine, this would also be no problem to reduce for the LocalEngine
, but since this is implemented using sleep
in the main loop it's currently not possible to only reduce it for the LocalEngine
separately.
Do you have a suggestion how to implement the logic for WAIT_PERIOD_JOB_FS_SYNC
? Basically the wait should only be skipped if it is between two jobs from the LocalEngine
.
In Task.finished
, there is this code:
minimal_time_since_change = 0
if not gs.SKIP_IS_FINISHED_TIMEOUT:
minimal_time_since_change = gs.WAIT_PERIOD_JOB_FS_SYNC + gs.WAIT_PERIOD_JOB_CLEANUP
I'm not sure if it is a good idea to have the WAIT_PERIOD_JOB_FS_SYNC
logic here in Task.finished
. Here we cannot really implement what I suggested. I think we don't even know here what engine this is running on? I think this has to be moved to the manager.
Then, in EngineBase
, there would be a new function allow_skip_job_fs_sync_wait_time
or so.
but since this is implemented using
sleep
in the main loop it's currently not possible to only reduce it for theLocalEngine
separately.
Of course you would replace the time.sleep
by sth like threading.Event
or so then, and the local engine, once a job finishes, it can interrupt the sleep.
I think it's not a simple change, the code got a little convoluted over time and the engine was never part of the equation.
The state is checked in a pull fashion, each job asks if all inputs paths are available, these path now check if their creator is finished or not. The engine was never meant to be part of the computation.
To get them you would have to get the last task of the running/finished job and the first task of the next job. Thinking about this, the Job.tasks function in meant to be called only after all inputs are available. So you should only get the first task of the next job if all inputs are available (ignoring the timeouts for local jobs). This isn't a nice task...
An alternative: A while ago Mirko and I worked on this PR when we had serious trouble with the NFS and many jobs crashed due to file sync issues: https://github.com/rwth-i6/sisyphus/commits/check-output-size/sisyphus/job.py
The idea was to log the size of all output files and the worker just checks before running the task if all input files are available and have the right size. I guess this could also be used to just drop WAIT_PERIOD_JOB_FS_SYNC
, but it would increase the number accesses to the filesystem.
It was working in my test runs, but we dropped working on it since the file system problems went away and I wasn't sure if increasing the number of filesystem accesses wouldn't cause more problems than it solved. Also it would have needed more testing to make sure it really works reliable, but if it works it would be possible to drop WAIT_PERIOD_JOB_FS_SYNC
for all jobs.
My recognition + scoring + summarizing results pipeline contains a number of very simple mini jobs which run one after another. The wait time after a mini job is annoying. I don't want to have any wait time between two consecutive mini jobs.
I'm not sure how others think about this (excluding those who just don't care). What are the reasons for having this wait time? I'm specifically talking about mini jobs, specifically jobs run via
LocalEngine
.I haven't looked too deep into the code yet. Some things I consider changing:
LocalEngine
and ignoreWAIT_PERIOD_JOB_FS_SYNC
,WAIT_PERIOD_JOB_CLEANUP
and some others for this.time.sleep(gs.WAIT_PERIOD_BETWEEN_CHECKS)
when someLocalEngine
job finishes.Maybe such changes are not wanted by everyone, so then this should be optional and configurable somehow?