q-m / scrapyd-k8s

Scrapyd on container infrastructure
MIT License
12 stars 8 forks source link

Scheduling #6

Open wvengen opened 11 months ago

wvengen commented 11 months ago

scrapyd has scheduling, while this project immediately starts running when a spider is scheduled. The idea is to start suspended Kubernetes jobs, and unsuspend them when it can start running.

Follow the way scrapyd is configured to configure this:

wvengen commented 10 months ago

For Kubernetes, jobs can be started suspended. The scheduler can unsuspend them. For Docker, jobs can use create instead of run. The scheduler can start them.

wvengen commented 3 months ago

Note that #28 may turn out to have a part that watches Kubernetes events. It is likely that the scheduling feature needs this as well (like, when a job finishes, a new job can be started). It could make sense to share some of the watching setup (so that there are not multiple connections to Kubernetes needed to listen for events, if that is cleanly possible).

vlerkin commented 1 month ago

Hi Willem, just want to ask if I understand the idea correctly. We want to unsuspend a job when we have enough cluster capacity to complete it, right? I am not sure that a pod watcher I added is helpful here since we only need to monitor idle cpu/memory thing. If you had something else in mind, please, share.

wvengen commented 1 month ago

Well, this is about the scrapyd-idea of scheduling: run a maximum number of jobs in parallel. We're not talking about cluster capacity here (that is handled automatically by Kubernetes).

So here we'd want to start jobs suspended, and have a scheduler loop unsuspend jobs when the number of currently running jobs is lower than the maximum. See max_proc and max_proc_per_cpu (though we don't have to follow this exactly, let's start with just max_proc).

p.s. you don't need to tackle this together with the log handling. Only when setting up the Kubernetes watcher, it may be useful to realise that another component may want to use watching as well.

vlerkin commented 1 month ago

I think it's good to think about this ticket, just in case if I need a watcher and need to refactor the code of the connected issue.

vlerkin commented 1 month ago

Could you please also elaborate a bit what do you expect from max_proc_per_cpu and its usage in this task? Do you want to override the resources that were defined in the yaml for each job to define which part of cpu we need to assign to a process to run x processes in parallel? Let's say we have 1 cpu and max_proc_per_cpu = 5, then I need to assign 0.2 cpu to each job.

Another point: I think we need a master process which will control job state, it is a separate type of watcher which is not connected to the one in the log handling ticket (that one is made optional and is not active by default, and we don't want to couple things with different responsibilities), does it make sense?

wvengen commented 1 month ago

The idea is to limit the number of concurrently running jobs ran by scrapyd-k8s: no more than max_proc should be running simultaneously.

For max_proc_per_cpu we might limit the number of jobs to the number of cpus in the cluster. But it's less useful here, Kubernetes has various load handling approaches, let's start with max_proc and then see what other tunables could be useful here. Requests and limits are much more useful here. max_proc then would allow us to not fill the cluster with all running jobs.

wvengen commented 1 month ago

Note that setting requests and limits (for memory and cpu) are already implemented, no need to worry about that here.

Ok, if controlling job state is a different kind of watcher, then you can ignore it for the purpose of log handling (though I seem to remember that log handling also involved watching for job changes, but I may be wrong).

vlerkin commented 1 month ago

It is watching all pods in the cluster and selects the ones with specific labels and status, yes.

vlerkin commented 1 month ago

On the other hand, we might make some sort of publisher/subscriber model, where a pod watcher is a publisher, once it notices specific changes, for example, in a status, it sends a message to a specific subscriber that activates log watching, or another one that does something else as a response to some other changes in pods.

As you said, better to finish the log PR as it is now and then as part of this issue, I can think of extracting that watcher-publisher logic.