failed workflow catcher

Keyrxng commented 2 weeks ago

See this workflow run regarding Build CI failing during the yarn step. I've came across this in testing plugin workflows where the workflow fails for reasons out of user control during a step it shouldn't such as yarn.

It's reasonable to expect this to happen again for plugin workflows and the kernel wouldn't know the wf has failed (afaik)

Events to work with below, I think it's possible to track workflow steps by name that fail that are expected to never fail. If so, it would be a beneficial plugin to have that can automatically refire these runs.

workflow_dispatch: [],
workflow_job: [],
'workflow_job.completed': [],
'workflow_job.in_progress': [],
'workflow_job.queued': [],
'workflow_job.waiting': [],
workflow_run: [],
'workflow_run.completed': [],
'workflow_run.in_progress': [],
'workflow_run.requested': []

I feel I lack the insight to properly define the spec for this one, if it is possible at all given the kernel structure.

0x4007 commented 2 weeks ago

You can output the error log to ChatGPT and it can determine a rerun or not.

502 bad gateway on a yarn install seems like a good candidate for a rerun but we must prevent infinite loops.

Keyrxng commented 2 weeks ago

You can output the error log to ChatGPT and it can determine a rerun or not.

502 bad gateway on a yarn install seems like a good candidate for a rerun

It would need to be a worker to respond quickly and if that's the case then why not leverage Cloudflare Ai in the worker itself to make the decision. It's likely faster than asking GPT and waiting for a resp.

CF AI has a couple of decent models I think would be capable enough of this task using llama3 or mistral.

https://developers.cloudflare.com/workers-ai/models/ https://developers.cloudflare.com/workers-ai/platform/pricing

but we must prevent infinite loops.

We could leverage KV here and track what runs get refired and then limit it that way?

0x4007 commented 2 weeks ago

It would need to be a worker to respond quickly

I don't see why timing matters much here aside from developer convenience.

I am not opposed to using Cloudflare AI Workers, just as long as we retain the same kernel-plugin architecture we are already building (which supports GitHub Actions or Cloudflare Worker runtime environments.)

We could leverage KV here and track what runs get refired and then limit it that way?

You can simplify by passing in a request header that says which run it is. i.e. X-UBIQUITY-RERUN

In the runtime logic you can check for this value and terminate as needed.

ubiquibot / plugins-wishlist

failed workflow catcher #21