Open vlerkin opened 2 weeks ago
Hope my feedback was at an angle that helps you at this stage. In any case, well done, keep it going!
p.s. the CI error looks like it could be cause by Kubernetes-specific things having entered into the main api code, which wouldn't work when running with Docker.
Working on Docker implementation to be added to this PR
I have problems because I separated this PR partially and now I have a multiverse which I need to refactor to the only source of truth. Going to spend some uncertain amount of time on that.
The way I would do this:
Of this is much work in many commits, you may consider first doing an interactive rebase of this PR, to simplify it, and reduce the number of commits (that each may need amending).
Yes, this is a bit of work, but something I come across now and then, in various projects. Sorry for the complexity!
Thank you for the advice! I was thinking of dropping the commit with the merging main to this branch, then make the code work so the tests run if needed. Then merging with that other branch that refactored the observer further and make the code of both branches work together and then check if there are any conflicts with main and resolving those. This is a bit longer way than simply redoing the merge with the main branch but I messed up the last one because I lost track of changes, so gradually rebuilding this branch is a bit easier for me.
No worries, this is me who messed up merging, complexity is part of the job:D Learning to make more granular commits and cleaner PRs the hard way:D
I modified one of the methods in the scheduler (get_next_suspended_job_id) to handle cases if a job does not have a creation_timestamp. It is not expected but if someone used a custom resource and forgot to add this field or made any other error, the job will get the timestamp assigned and will be processed like other jobs in the queue.
Also, there are now unit tests that cover different scenarios for the scheduler.
If you have any other comments for improvements, let me know!
What happens in the PR:
The big picture: Event watcher connects to the k8s api and receives the stream of events, it then notifies the subscribers if a new event is received and passes it to the provided callback. The subscriber - KubernetesScheduler - receives event in a handle_pod_event method, this method reacts to the changes in job statuses, and if job completed running or failed it calls another method - check_and_unsuspend_jobs - that checks capacity and unsuspends jobs until the number of allowed parallel jobs is reached, while doing this it relies on another method - get_next_suspended_job_id - to unsuspend the most recent job, to keep the order in which jobs were initially scheduled. When the job is scheduled, based on the number of currently active jobs and max_proc provided in the config (default is 4), the job runs or goes to the queue of suspended jobs (native k8s queue). Then events that change the number of active jobs trigger the logic of KubernetesScheduler class that unsuspend suspended jobs until the desired state (num of parallel jobs) is achieved.