Open jakirkham opened 6 years ago
I think that part of this is wrapped up in https://github.com/regro/cf-scripts/issues/53, since it is difficult to know exactly what kind of stress to expect on the system without some ballpark numbers. The whole bootstrap of the system also exaggerates some things (since we max out the CI time on every run of 03
).
Currently we are clearing about 20 feedstocks per run with 03
and all the feedstocks with the others (although we'll fall behind when we hit 5000 feedstocks on 01
).
I'm not opposed to moving this to a webservice, but the notification wrangling could be hard.
We also could do this in steps/build a hybrid system:
00
can be removed with a hook on staged (or via the feed, https://github.com/regro/cf-scripts/issues/38)
01
can be removed with a hook on all the feedstocks when PRs are merged (or via the feed, https://github.com/regro/cf-scripts/issues/38)
02
might be the most difficult to remove since listening for new releases from PYPI, CRAN, and GitHub may be difficult.
03
I'm less certain about, since I don't know what triggers it (I guess whatever triggers 02
). If we can find a trigger for 02
then 02
and 03
could be merged.
Honestly our experience doing this at conda-forge has taught us that the system ends up being less stressed when converted from batch to a webservice. If you think about it a bit, this actually makes sense. The reason being updates in a web service don't all come at once in a big batch (this could be re-renderings, updating Circle SSH keys, or package updates). Instead they are sprinkled throughout the days at various times. The result ends up being things stay pretty light and the system handles events right away, which makes the whole thing more maintainable.
Agree that, handling the detection of updates has been and remains challenging. PyPI lacks the right kind of notification. ( https://github.com/pypa/warehouse/issues/1683 ) Same story with R. Both provide index wide feeds (Python, R), which we could parse. Not sure what we do with everything else. Maybe piggyback on Arch Linux? For the cases where we have feeds, we could have a process that filters these for us and triggers the update PRs. Presumably this would live on Heroku. Though could live elsewhere.
Just to outline this a bit, it sounds like we would want the webservice to handle these events. Am I missing any?
Given how package indexes seem to handle these problems, our web service would need to be designed around processing these feeds. Namely it would check feed notifications against a listing of packages. Periodically a new package could be added, in which case we would need to check its version independently and then add it to the list. Removal would be relatively straightforward. In some ways, it might not be worth processing feedstock updates (possibly removals), as this could easily be checked when the feedstock's package comes up again.
Thoughts?
@isuruf, might be interested in this. 😉
@isuruf, might be interested in this.
I have an idea about using Libraries.io, Github and IFTTT to make this a webservice. Will look into it once I have some time.
I think this is now available for action. The graph is stored in a json
format an so can be written to by pretty much anything. We could provide a webservice with the bot's credentials (or provision a new bot) so it could update the versions in the graph. Each package (that is not a stub or archived) should have a new_version
key that represents what the bot thinks the newest upstream version is.
If external things write to the graph we could then kick off github actions that then cause PRs to be issued.
@CJ-Wright @beckermr Is there a way to clarify it somehow?, I read some of the issues regarding the web services, but some of them are all jumping into a migration or a closed PR.
Sorry what do you want clarified?
Sorry what do you want clarified?
What are the webservices ? I want to understand and only want to help.
Sorry, it's no problem I didn't know what you were asking. Conda-forge has a bunch of web services, these are tasks/jobs/things that are triggered by some action on the web. For instance if there was something that published that a new version was available we wouldn't need to scrape the web for it. Similarly, rather than updating all the feedstocks in the graph every run we could just update the ones that have changed.
uhm... I kind of get it. But just to be sure, when you say webservices you are referring to services like Azure, CircleCI and others ? or it's something else like a server request ? (Also, thanks for the reply)
The code for the existing webservices (which run things like the team and token updating) is located here if you want to take a look.
My understanding (@beckermr might be able to provide more insight here, since has contributed considerably to our webservices) is that we setup a server (usually a heroku instance) that listens for updates from webpages and then acts accordingly.
The code for the existing webservices (which run things like the team and token updating) is located here if you want to take a look.
My understanding (@beckermr might be able to provide more insight here, since has contributed considerably to our webservices) is that we setup a server (usually a heroku instance) that listens for updates from webpages and then acts accordingly.
Humm, now I understand what was been said. My understanding of the web services was roughly different. Thanks.
"rather than updating all the feedstocks in the graph every run we could just update the ones that have changed."It's an exceptional idea isn't it ? Is there a way for me to help ? I saw the items list above, but its still vague.
So the essential idea of this issue is to refactor the bot into a distributed system that responds to events.
Imagine we are running a migration and package A depends on package B. When the PR for package B is merged/closed, we could detect this event by listening to a webhook. When we see that, we could look at the graph and queue up the PR for package A. We could then have a cron-ish job read from the queue and try to issue the migration.
This would be a big refactor of how the bot works and is pretty out of scope right now.
So the essential idea of this issue is to refactor the bot into a distributed system that responds to events.
Imagine we are running a migration and package A depends on package B. When the PR for package B is merged/closed, we could detect this event by listening to a webhook. When we see that, we could look at the graph and queue up the PR for package A. We could then have a cron-ish job read from the queue and try to issue the migration.
This would be a big refactor of how the bot works and is pretty out of scope right now.
Oh, ok... thanks Cj and Matt thanks for the comments, now I have an idea of it.
To be fair we could have some things done by webservice, for instance marking a PR as merged/closed might be possible now. I think the main issue there is that the GH repo for the graph is rather large and might not fit inside the server. (this was part of the initial reasoning to move to something like dynamo, which we should really put inside a milestone, all the things that need a distributed database like thing)
To be fair we could have some things done by webservice, for instance marking a PR as merged/closed might be possible now. I think the main issue there is that the GH repo for the graph is rather large and might not fit inside the server. (this was part of the initial reasoning to move to something like dynamo, which we should really put inside a milestone, all the things that need a distributed database like thing)
What was the reason that dropped the idea for Dynamodb ?
We weren't able to implement it in a way that was cost effective and other issues were more pressing
We weren't able to implement it in a way that was cost effective and other issues were more pressing
uhm, and there isn't any other platform we could try ? I think it could be a great improvement to reduce the burden with the CI clients.
If you can find another provider then go for it.
Don’t spend money without asking
Don’t spend money without asking
Ok I will definitely not do that, but it's a good advice thanks.
@beckermr What about MongoDB ?
Mongo could work although you need to host it somewhere
Mongo could work although you need to host it somewhere
Yup, actually I was wondering about the cloud mode of it, but I was not sure about the amount of data we will need (as the cloud is limited to 5gb)
I think the first move there is figuring out how little of the PR json we can get away with.
I think the first move there is figuring out how little of the PR json we can get away with.
Or maybe some classes of PR's we could get rid of.I was wondering in doing the 'track opened and closed PR's' first to reduce the number of PR's hosted, and than migrate the result to a table in some NoSQL server. (we could also set this list into a web service, this will allow us to not bump at any API limit). I can also be missing something too.
It’s certainly reasonable to start out with a corn job for these sorts of things. Also as we resolve some technical debt, the cron job is very helpful. That said, we have generally found in conda-forge that cron jobs inevitably struggle to scale.
To solve this problem, have ultimately moved all of them to web services that use webhooks. This allows them to deal with notifications as they come in and respond by doing some task. This approach seems well suited for updates. However it will require some thought into how we can get notifications from package indexes, GitHub, etc. Expect this will iron out any issues related to load.