Kernel queue - Githubissues

rndquu commented 1 month ago

There might be cases when the kernel receives too many webhook events, for example:

some file is deleted from all of the repos
too many partners are using the ubiquity-os app

When the kernel receives too many webhooks it exceeds github API rate limits which causes the kernel to be unresponsive and throwing the Rate limit hit with \"GET /repos/{owner}/{repo}/contents/{path}\", retrying in 1469 seconds error.

There should not be any downtimes in the kernel work.

As a part of this issue we should implement a webhook queue:

on new webhook event add it to the queue
if rate limit is not exceeded proceed the next item in the queue (with cached config received at the time of webhook creation)

gentlementlegen commented 1 month ago

We should deal with priorities because I think the most important part is to keep commands running otherwise users wouldn't understand why their commands don't do anything.

Second this could be mitigated by having plugins crawling (like daemon-merging and daemon-disqualifier) running only once a day, because I think they reacted to all these events and run quite a number of times, and these crawl the whole orgs so require quite a lot of API calls.

rndquu commented 1 month ago

We should deal with priorities because I think the most important part is to keep commands running

When github app installation token is rate limited then the kernel stops executing any command in the chain so prioritizing plugins in the chain doesn't prevent rate limiting.

Second this could be mitigated by having plugins crawling

Yes, right now daemon-disqualifier runs pretty often.

To sum up there're actually 3 issues:

Our core plugins may be not optimized in terms of using github API. We could at least reduce the number of event handlers (i.e. number of ubiquity:listeners events).
Third party plugin developers may actually "grief" ubiquity-os users by consuming all available rate limits (because 3rd party plugin devs have access to the ubiquity-os installation token). Not sure what's the solution here.
Right now the kernel is not scalable in terms that on too many webhook events it becomes unresponsive for 25 minutes.

rndquu commented 1 month ago

I've temporarily removed time estimate because the spec is not clear.

As far as I understand from the docs github app installation token cannot increase beyond 12,500 requests per hour per repository (or organization?).

@gentlementlegen What if we restrict plugins' github API usage this way:

Add a new config parameter to the kernel like PLUGINS_GITHUB_API_AVAILABLE_QUOTA=10 which would mean that a single plugin in the chain can not consume more than 10% of available github API resources per hour
Before plugin execution in the chain the kernel checks available limits
After plugin execution the kernel checks available limits again. If quota was reduced more than by 10% then: a) Email notification is sent to plugin developer (fetched from manifest.json) b) Plugin execution is suspended for some time (like 30 mins)

This way plugin developers will have to optimize github usage of their plugins. This solves the issue with github API "griefing".

Keyrxng commented 1 month ago

My storage layer app needs installed by each org and gives us a new access token to work with that is bound to each individual org and each org can be tracked via installs easily. If we create a storage repo instead of using the config repo as the storage location

we don't expose the sensitive config repo to the storage layer app which everyone is going to be able to access
We can use the token which is bound to the org instead of us sharing the kernel'

rndquu commented 1 month ago

My storage layer app needs installed by each org and gives us a new access token to work with that is bound to each individual org and each org can be tracked via installs easily. If we create a storage repo instead of using the config repo as the storage location

we don't expose the sensitive config repo to the storage layer app which everyone is going to be able to access

We can use the token which is bound to the org instead of us sharing the kernel'

I don't have much context regarding the storage feature so it's not really clear:

What's the difference between the "storage layer app" and ubiquity-os app and why partners need to install both
Why we need a separate repository for storage and can't simply use the .ubiquity-os folder
Why we can't expose the config, everything is going to be encrypted there
What's the difference between the "storage layer app" installation token and the ubiquity-os one, as far as I understand there're the same entities (i.e. conform to the same 12.5k github API rate limit)

Keyrxng commented 1 month ago

https://github.com/ubq-testing/telegram--bot/blob/gh-storage/src/adapters/github/storage-layer.ts

storage app will be responsible for handling all of that partner' storage need;, giving it it's own rate limit as once implemented across all plugins the rate consumption is going to be big due to the nature of using GH as a DB. TG Bot used kernel' key to dispatch it's own wfs but deemed unsecure. We needed a token for each partner and one that has auth to read/write from inside plugins not the kernel. Installation octokit solves this and makes it easier identifying storage locations
2, 3. If we use .ubiquity-os we expose the config. If that's not a problem then we don't need another repo and can use it.
They'd share the same rate limits as you state but they'd be bound individually to the partner as opposed to all eventual partner' using a centralized token.

After plugin execution the kernel checks available limits again. If quota was reduced more than by 10% then: a) Email notification is sent to plugin developer (fetched from manifest.json) b) Plugin execution is suspended for some time (like 30 mins)

Maybe I'm mistaken but isn't it possible that with an abundance of partner' all using the same token, plugin devs could optimize to respect this threshold but have the rate limit consumed to nil because of the size of the partner network? There are 3 or 4 rn all ours but not super crazy active like onboarding a couple of huge blue chip partners might be.

It's just a suggestion, obv there are overlaps that would need ironed out but my core suggestion is for plugins to use a token bound to the partner using a GitHub App' installs which would require exposing app_private_key, in this way I believe that we obtain a partner-bound token

rndquu commented 1 month ago

Maybe I'm mistaken but isn't it possible that with an abundance of partner' all using the same token, plugin devs could optimize to respect this threshold but have the rate limit consumed to nil because of the size of the partner network?

Yes, this approach prevents partners from abusing github API rate limits but not solves the scalability issue when there're simply too many partners and thus too many incoming webhook events. We need some kind of a github API cache or queue system, whatever is simpler/faster to implement which is enough for "scalability v1".

Keyrxng commented 1 month ago

I updated my comment just as you posted sorry

It's just a suggestion, obv there are overlaps that would need ironed out but my core suggestion is for plugins to use a token bound to the partner using a GitHub App' installs which would require exposing app_private_key, in this way I believe that we obtain a partner-bound token

Plugins would use the partner-bound token and the kernel could also use it per-org I believe (@whilefoo @gentlementlegen rfc?) which would help but your right it's still very likely to max with things as-is

gentlementlegen commented 1 month ago

@gentlementlegen What if we restrict plugins' github API usage this way:

Add a new config parameter to the kernel like PLUGINS_GITHUB_API_AVAILABLE_QUOTA=10 which would mean that a single plugin in the chain can not consume more than 10% of available github API resources per hour Before plugin execution in the chain the kernel checks available limits After plugin execution the kernel checks available limits again. If quota was reduced more than by 10% then: a) Email notification is sent to plugin developer (fetched from manifest.json) b) Plugin execution is suspended for some time (like 30 mins) This way plugin developers will have to optimize github usage of their plugins. This solves the issue with github API "griefing".

It would be nice if somehow these quotas would count against their app / repo instead, no idea if this can be achieved. Otherwise yes that could work, don't grasp the complexity of such change.

rndquu commented 1 month ago

It would be nice if somehow these quotas would count against their app / repo instead, no idea if this can be achieved.

I feel like I need a primer on the kernel. I actually thought that the kernel uses ubiquity-os app's installation token and passes it to all of the plugins. So I thought that on the kernel side we could simply call this API method to check available limits.

whilefoo commented 1 month ago

2. Third party plugin developers may actually "grief" ubiquity-os users by consuming all available rate limits (because 3rd party plugin devs have access to the ubiquity-os installation token). Not sure what's the solution here.

I don't think there's incentive to grief because the rate limits are bound to a specific org and don't interrupt other orgs. The partner can just disable the plugin if it's using all of their rate limits

I feel like I need a primer on the kernel. I actually thought that the kernel uses ubiquity-os app's installation token and passes it to all of the plugins.

That's correct, the kernel uses org-bound installation tokens and passes it to plugins

It's just a suggestion, obv there are overlaps that would need ironed out but my core suggestion is for plugins to use a token bound to the partner using a GitHub App' installs which would require exposing app_private_key, in this way I believe that we obtain a partner-bound token

The tokens passed to plugins are already bound to the partner/org so I don't quite understand how the storage solution would help to solve this

if rate limit is not exceeded proceed the next item in the queue (with cached config received at the time of webhook creation)

if we rate limit is hit then we can't even download the config though unless there is a separate rate limit for that

Keyrxng commented 1 month ago

The tokens passed to plugins are already bound to the partner/org so I don't quite understand how the storage solution would help to solve this

Oh I didn't realise that was the case already my mistake

Keyrxng commented 1 month ago

See how each plugin has conditions that will cause it to "skip" or stop execution, some plugins have these checks right at the entry point and the check itself is very simple.

If we can expose these conditions from the manifest of plugins and have the kernel perform it first then we'd save a lot of false starts for actions at least.

E.g: command-ask currently spawns a workflow for every single comment that appears across all partner installs whether it begins with "@ubiquityos" or not.

Surely possible to reduce the boolean logic down to something that the kernel can consume. In command-ask the condition most simply is does payload body include an env var, all of which the kernel passes down. A fix for my example is to include a commands entry in the manifest, which does for slash commands exactly what I'm suggesting but I'm proposing we expand it so we can tell the kernel to compare arb env:payload vars and deny firing requests based on that.

This is less of a global solve and more of a potential optimization.

ubiquity-os / ubiquity-os-kernel

Kernel queue #164