Closed luator closed 4 months ago
I tested by running some of the examples. It seems to work well but hard to tell for sure. Do you know some way to test more systematically?
I also think that UDP is not ideal. To my understanding, it can happen that packets are lost and we wouldn't know. I don't know how big of a problem this actually is in practice, though. Anyway, if it's possible to switch easily, I'd prefer some more reliable protocol. I'll try to understand what changes would be required to switch to TCP.
File-based communication might be tricky with concurrency, so I'd only try this if there is a reliable library which handles this (sqlite might be an option, not sure how well it performs if there is lots of concurrent write access). It also has the disadvantage that the server needs to poll for updates in this case, but that's maybe a minor issue.
I changed the implementation to use TCP now (based on example from asyncio documentation). At least locally it was working, will test on the cluster now.
Galvani seems to be overloaded at the moment (>800 jobs pending) so I can't test there but I did a runs with example scripts on the MPI cluster and everything seems to work.
Now finally managed to test on Galvani as well. A run with 1000 jobs passed without failures.
Note: Last pushes just rebased on master and swapped last two commits (so I can more easily switch between TCP and UDP for testing), so no actual code changes to review.
I dropped the last commit to go back to UDP as discussed above.
Python's asyncio provides everything we need to set up a simple UDP server. Use it instead of pyuv to get rid of an unnecessary third-party dependency.
This also has the benefit that we avoid a direct dependency on a git repository, which is blocking us from publishing cluster_utils to PyPI.
Two additional minor changes:
MinJob
class.mark_failed_jobs
if there no jobs to check.Fixes #80
I'll do a bit more testing but so far it looks good, so I think it's already ready for review.