isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
11 stars 3 forks source link

Pipeline can fail if segmentation server already running #1199

Open spigo900 opened 1 year ago

spigo900 commented 1 year ago

Currently when object segmentation is enabled, the pipeline script sets up dependencies between jobs so that everything fails if the segmentation server isn't able to bind to a port. Unfortunately this means that if you run multiple experiments with STEGO in parallel then they trip over each other at the server part. One of the experiment will run the server successfully and all others will fail to. The successful experiment will then cancel the server as soon as it is finished with the server, even though other jobs may still need to use the server (and may be actively using it).

This is a problem because it makes it harder to run such experiments in parallel.

A dumb solution is to just apply --dependency=singleton to the server job. This should prevent multiple "start server" jobs from running at one time. However, this has the disadvantage that the entire rest of the experiment has to wait on "that experiment's" server start job to run, meaning you can't interleave jobs from different experiments even though in theory their segmentation jobs could share a server.

This dumb solution is currently the best I've thought of. I think it's good enough for now. Below I sketched some (messy) thoughts on what a better solution would need (based on the changeover issue mentioned there I don't think it's worth it).

Messy thoughts on a better solution

Here are some thoughts about what we'd need for a better solution:

  1. A smarter solution would first have to check if the server is reachable, not if a particular job is running.
    1. This should be relatively straightforward -- rather than having the segmentation job depend on the server's Slurm job, instead have it do curl in a loop, sleep for a minute or two if it fails, retry up to say 5 times.
  2. But second, we probably don't want to run the server 24/7, so we need to clean it up somehow.
    1. Waiting on Slurm timeout (23 hours) would be fine except that you may have queued up e.g. 8 jobs if you're launching 8 parallel experiments, which means it'll be a long time before those jobs all collectively run and time out.
    2. Here's a hacky idea: We could queue an scancel job delayed to run in 24 hours (using --begin=now+24hours) which cancels all currently queued server jobs (as gathered using squeue) and this might work OK combined with the singleton option (since that way we're guaranteed at least one server should be running).
  3. But no matter how we clean up the server, at some point one server job must end and another must start. This is likely to be a pain if a segmentation job is running during the "change-over" between server jobs.
    1. If we really wanted to do this, we'd have to modify the Python script that does segmentation so that it can handle the server going down for up to a few minutes at least, and potentially longer, depending on how lucky we are with Slurm allocations.
    2. Meanwhile, I am reasonably confident we don't have to worry about changeover if the server job started around the same time as the segmentation job, as it would have to if we're just adding singleton on top of the existing setup. M5 objects train is 600 images and it only takes 6 hours, and I think the new curricula we're expecting are no larger than that. So we should be well within the time limit for the server job.
    3. So: Probably the singleton approach is best for now.
lichtefeld commented 1 year ago

I agree currently the singleton is currently our best approach.