Consider switching to file-based communication system #28

Open luator opened 1 year ago

This issue is to track changing away from network-based communication between clients and server to a filesystem-based communication.

The main motivation is to make the communication more robust by relying on the shared filesystem as a safer channel. Also, we could remove the dependency on libuv, which is not properly maintained anymore.

This switch reaches deep into the internals of cluster utils and thus might require quite some work and testing to make sure it works correctly. Moreover, we need to think about how to actually implement communication over the filesystem properly.

Speaking against this change is that it makes cluster utils reliant on a shared filesystem, whereas now only a network connection is needed.

One more anecdotical point: Michal (the original developer) once said that he should have used the filesystem for communication in the first place.

By @mseitzer (imported from GitLab)

Some thoughts on this which I quickly want to dump here, before I forget it again:

We should not try to handle the concurrent access ourselves, instead we should use some well established library. The only option that came to my mind so far is Sqlite. Sqlite is not really meant for massive concurrent write access but I expect that for our application it should be okay. It allows only one writer at a time, so if multiple jobs want to write at the same time, they have to wait for their turn. However, writes should be relatively fast, so I don't expect the wait time to be long in practice. Even if a job needs to wait for a few seconds, this would probably be negligible considering the total run duration of the job.

If using Sqlite, something like this might be a good option:

Have a table with columns job_id and status.
The server adds a job with status "SUBMITTED" when it's submitted
The job changes this to "STARTED" / "EXIT_FOR_RESUME" / "FINISHED" accordingly
The main process needs to poll this table (this is the somewhat annoying part) to learn about updates
Results don't actually need to be written to that database. They are anyway saved to a file in the working directory already, so the main process can simply read them from there.

I'm not sure if the early results fit in there like this, though, so maybe needs some modifications to really work. Maybe early results could be written to separate Sqlite databases in the working directories? This would avoid concurrent write access by the jobs (each job has its own db) and should still handle concurrent write/read between job and main process.

Note that to my knowledge, sqlite relies on file locking to coordinate writing. In the past, file locking got disabled on some of the shared filesystems for performance reasons, so I don't think we can rely on its existence on a cluster.

I'm not sure about alternatives though. Any file-based solution will have to somehow coordinate concurrent access, and if file locking is not available what else is there?

One possibility could be to let each job write to separate files, with the server only ever reading from those files. This way, there can not be concurrent writes at least. This requires more polling from the server though.

Another comment: my ideal would be to run both file- and network-based communication in parallel, as a form of redundancy. My experience tells me that the shared file systems on clusters are also not to be trusted, at least in terms of performance, but I'm also not sure how fast file changes propagate when reading an updated file under load.

I am aware that coordinating both forms of communication would require significant complexity though, so I'm not sure if we should go for it. But maybe we can at least keep both forms of communication available, using network as a fallback.

One possibility could be to let each job write to separate files, with the server only ever reading from those files. This way, there can not be concurrent writes at least. This requires more polling from the server though.

We would still need to be careful to not read a file that is currently written to. I actually don't know how filesystems work in this regard, i.e. if this is even a problem or if the filesystem already prevents it somehow. One option might be to use empty indicator files (something like {job_id}.started). Just creating/deleting them should hopefully be rather atomic actions.

Another comment: my ideal would be to run both file- and network-based communication in parallel, as a form of redundancy.

I already thought about this as well. We could keep the current network communication, at least for status updates, and in addition poll output files, but at a rather low rate. So normally updates would be received immediately and only if the network fails, one would need to wait longer for the next file-based check. I haven't looked into it in detail yet, but I actually don't think that it would add too much complexity to have both in parallel.

martius-lab / cluster_utils

Consider switching to file-based communication system #28