KCIv2 node state changes proposal

nuclearcat commented 5 months ago

state.md

I have suggestion to Node state values, but this is not a priority and more improvement than bug fix.

First need to discuss, what is the point of having state "closing", why it is not directly "done"? If anybody have idea. Actually in some pipeline services it is directly set to done.

My proposal

Right now it is:

class StateValues(str, enum.Enum):
    """Enumeration to declare values to be used for Node.state"""

    RUNNING = 'running'
    AVAILABLE = 'available'
    CLOSING = 'closing'
    DONE = 'done'

Proposing to change to following:

% When node is just allocated but no real work is done yet AVAILABLE = 'available' % When node is submitted to the queue, changed by scheduler when pipeline service submit node task to the runtime SUBMITTED = 'submitted' % running set by runtime task itself at beginning of the task % When node is running, this is important part, for example with k8s, as node might be in pending state for long time or even discarded due k8s issues % It might help even with docker, for example when node is submitted to containerd, but it contains error in code, so it didn't run % In cases of LAVA at first we might not be able to see node transition (will need polling), so node might become directly from available to running in some cases RUNNING = 'running' % When node is done, final results are received DONE = 'done'

Reasoning: 1)It will allow us to measure queueing time, running time, and done time, which is important for statistics. 2)If node hangs in some state, it will be easier to debug, where it is stuck. For example submitted to particular cluster and timeout - means cluster overloaded or malfunctioning

gctucker commented 5 months ago

Out of interest, what is KCIv2? Is that basically the new API?

gctucker commented 5 months ago

There's a slight issue with the vocabulary you're using here: a node itself is not "running", jobs are running. Node are just data bits to contain results and information about the state of the pipeline. Also, assuming what the pipeline might be doing is likely to lead to corner cases in the API and issues in the state machine design. It should be entirely separate, as anything can use the API and submit data with all kinds of workflows.

What are the issues you've hit with the current state machine that led you to propose an alternative?

nuclearcat commented 5 months ago

Out of interest, what is KCIv2? Is that basically the new API?

As discussed on IRC for sake of simplicity unofficially (for now) i call (software stack) used for new generation of kernelci - KCIv2 or KernelCI v2.

What are the issues you've hit with the current state machine that led you to propose an alternative?

I provide in first document paragraph: "Reasoning:". To expand first example, we have sometimes delays, and in certain cases, for example, it was k8s scheduling delay due unavailable resources, but it some cases k8s node itself was misbehaving and kernel that is supposed to compile within 10-20 minutes time, took more than 60 minutes. As second example, there was cases when nodes was evicted, probably because they are allocated as "Spot" instances, and scheduler is not aware, why task is not completed and on which stage.

gctucker commented 5 months ago

Out of interest, what is KCIv2? Is that basically the new API?

As discussed on IRC for sake of simplicity unofficially (for now) i call (software stack) used for new generation of kernelci - KCIv2 or KernelCI v2.

OK, issues like this one are public so being consistent with naming is still important here. Actually, just moving this issue to the kernelci-api project and calling it "node state change" would work just fine.

What are the issues you've hit with the current state machine that led you to propose an alternative?

I provide in first document paragraph: "Reasoning:". To expand first example, we have sometimes delays, and in certain cases, for example, it was k8s scheduling delay due unavailable resources, but it some cases k8s node itself was misbehaving and kernel that is supposed to compile within 10-20 minutes time, took more than 60 minutes. As second example, there was cases when nodes was evicted, probably because they are allocated as "Spot" instances, and scheduler is not aware, why task is not completed and on which stage.

Sorry I don't understand how this relates to the proposed change in node states. I was merely curious, I don't really want to discuss it in detail here but others should do that I think. Node states are a critical part of how the whole modular pipeline can get orchestrated and a lot of thought has already been spent on this so I would only advise to be very careful when making changes there.

kernelci / kernelci-project

KCIv2 node state changes proposal #309