Infinite loop using wildcard event

gentlementlegen commented 1 month ago

Recently we suffered from infinite loops in the kernel combined with actions. After investigating, it seems that the combination of the wildcard event ('*') with an Action was the cause. What happens is the following:

the kernel receives a webhook event, say a push event
the kernel dispatches that event to plugins located within '*'
the action starts, say user-activity-watcher
this summons the Github webhook that sends a workflow_started event to the kernel
kernel dispatches that event to '*'
action starts again

and so on, creating the loop. It doesn't happen with Workers because they don't trigger any event within Github when they are run.

I think this highlights the dangers of subscribing Actions to that wildcard. Maybe we should consider removing it and changing the way we configure plugins. My idea was to have a allow | forbid list where we can configure which events to react to, or which events to ignore, per plugin in the configuration file. Now that we are working on the manifest feature, we could even allow plugin creators to set some defaults there, that we can eventually override within the configuration.

However despite these changes we would not be safe from infinite loops (for example, a plugin posting a comment when the event comment posting is triggered). We should think of a way to limit runs if that happens, to avoid bursting through cloudflare limits and github api.

@0x4007 @whilefoo @rndquu for vis

0x4007 commented 1 month ago

Recently we suffered from infinite loops in the kernel combined with actions. After investigating, it seems that the combination of the wildcard event ('*') with an Action was the cause. What happens is the following:

the kernel receives a webhook event, say a push event

the kernel dispatches that event to plugins located within '*'

the action starts, say user-activity-watcher

this summons the Github webhook that sends a workflow_started event to the kernel

kernel dispatches that event to '*'

action starts again

and so on, creating the loop. It doesn't happen with Workers because they don't trigger any event within Github when they are run.

I think this highlights the dangers of subscribing Actions to that wildcard. Maybe we should consider removing it and changing the way we configure plugins.

I feel like the most straightforward solution would be to include metadata in the invocation request from where is the event source. We can seek out and reject requests that are recursive.

My idea was to have a allow | forbid list where we can configure which events to react to, or which events to ignore, per plugin in the configuration file. Now that we are working on the manifest feature, we could even allow plugin creators to set some defaults there, that we can eventually override within the configuration.

I don't think overriding partner configs is a good idea, however setting defaults so that it's easier for partners to set up could be useful.

Perhaps the best solution in the future is a UI that partners can use to edit their yml configs. The UI will handle all the yml editing. We will abstract the text of the yml and instead make it sort of a form with text inputs and a save button. It can automatically prefill the form with the recommended config.

From this UI we can load default/recommended configs from the plugins.

However despite these changes we would not be safe from infinite loops (for example, a plugin posting a comment when the event comment posting is triggered). We should think of a way to limit runs if that happens, to avoid bursting through cloudflare limits and github api.

This makes me go back to my first point. I am unsure of the implementation details but it seems sensible that we eliminate recursive calls by checking the "invocation source" of the current request.

@0x4007 @whilefoo @rndquu for vis

rndquu commented 1 month ago

@gentlementlegen Why do we even need the * wildcard event?

We should think of a way to limit runs if that happens, to avoid bursting through cloudflare limits and github api.

Yes, as 0x4007 stated, there must be a way to set metadata (i.e. invocation id) on plugin invocation which will be used later to reject subsequent recursive calls.

0x4007 commented 1 month ago

@gentlementlegen Why do we even need the * wildcard event?

The original idea was to invoke the unassignment for neglected tasks. For example: there is a task that had no activity for seven days, so the bot should check every assigned issue in an organization if any event happens inside of the organization.

I anticipate that we will have many organizations with little to no activity. This approach saves a tremendous amount on compute compared to cron etc.

Given that issues_comment.created is the most frequent event, perhaps, if we must choose a single event, we can piggy back this plugin off of any comment across any repository within an organization. Then the bot will check every assigned issue for its last activity time across the entire organization.

gentlementlegen commented 1 month ago

@rndquu No we don't really need to. The two main reasons to use the wildcard are:

unpractical configuration file (to have one plugin listening to multiple events, we need to copy paste the whole configuration in several places)
watchers that need to listen to every event (we try to avoid using crons, so for plugins like the user-activity-watcher the more updates we have the better)

If we solve these two points wildcard becomes obsolete.

A fix I though of is to have something like this in the plugin manifest:

{
  "name": "Query user",
  "description": "Queries a user and retrieves its related information, such as the wallet, the label access control, or the current XP.",
  "runs_on": [ "issues.opened", "issues.closed" ]
  "commands": {
    "query": {
      "example": "/query @ubiquibot",
      "description": "Returns the user's wallet, access, and multiplier information."
    }
  }
}

The runs_on section would determine what events the plugin is reacting to. The good thing is that we would not have to worry about putting the plugin in the right events.

This fix will also greatly reduce our KV usage which is currently high.

0x4007 commented 1 month ago

The runs_on section would determine what events the plugin is reacting to. The good thing is that we would not have to worry about putting the plugin in the right events.

It seems nice that the plugin will decide on behalf of the partner what events it should listen to. This is more hands-off for our partners.

The only concern that I have is if a partner wants to custom map commands to different keywords. But perhaps that is not important for a long time (at least until we have plugins with command keyword collisions.)

rndquu commented 1 month ago

The runs_on section would determine what events the plugin is reacting to.

So plugin keys (example) will become obsolete and partners will simply set a list of plugins they want to use like:

plugins:
  - uses:
      - plugin: org/plugin1
        with:
            param: a
      - plugin: org/plugin2
        with:
            param: b

0x4007 commented 1 month ago

I am cautious to agree with this proposal. Are there no situations with our existing plugins where we would want a certain plugin to ignore a specific event? For example, in repo a we use the default behavior of the plugin, but in repo b we need to ignore a specific event for that plugin.

It seems "safer" to allow partners to choose what plugins respond to what events. Although I can't think of any specific ignore scenarios we would need at this time. @gentlementlegen any ideas?

On the other hand, I am optimistic for any proposal that simplifies the partner experience. The new proposed config seems more partner friendly.

gentlementlegen commented 1 month ago

Maybe within the configuration we could override the behavior if needed, such as:

{
  "name": "query-user",
  "runs_on": [ "issues.opened", "issues.closed" ]
}

But we only want it to run on opened event:

plugins:
  - uses:
      - plugin: org/query-user
        runs_on: [ 'issues.opened' ]

Which would give full control over the events by overriding them. I do not have any practical use case at the moment.

0x4007 commented 1 month ago

Which would give full control over the events by overriding them. I do not have any practical use case at the moment.

In this case we can proceed, and we can add support for overrides later?

whilefoo commented 1 month ago

This approach seems interesting. First thing that comes to mind it that we have to fetch manifest for the first plugin in every chain and that might be 10+ requests however that could be solved with some kind of cache

0x4007 commented 1 month ago

Unfortunately KV reads also cost quota. Might be best to not cache.

gentlementlegen commented 1 month ago

Sadly we are tight on both quotas:

50 requests per worker run
1000 KV writes a day

I think that currently KV writes are a more precious resource to save because they are per 24 hour period. On the other hand 50 requests could be handled by having a Worker whose only job is to retrieve the configurations, which could be part of the Kernel itself by using service bindings.

I'll get a first draft with no cache and let's see if that becomes problematic.

Opened a related PR for the configuration changes: https://github.com/ubiquity/ubiquibot-kernel/issues/73

gentlementlegen commented 1 month ago

I believe this problem is half taken care of, since the manifest.json allows us to avoid using *. Only issue remaining is plugin recursive calls.

ubiquity / ubiquibot-kernel

Infinite loop using wildcard event #72