Improve how new packages are handled

g-arjones commented 3 years ago

Currently, adding and properly testing new packages is a painful and unintuitive process. To be able to properly test things one must:

Add a dummy (compilable) code to mainline of the new package;
Open a PR to a mainline package set
Merge the package set PR
Open a PR of the "first version" of the new package

Step 1 is necessary because otherwise, the package set PR (from step 2) build fails, and step 3 is necessary because otherwise the daemon won't generate an override entry for the new package (because it is not defined yet). This is very difficult to understand for people that are not familiar with how all the pieces work together.

Now, there is https://github.com/rock-core/autoproj-github that handles new packages by switching package sets branches before generating overrides. This may involve multiple restarts and it is hard to guarantee consistency if something breaks between restarts (would have to implement some kind of rollback mechanism) so it is hard to implement something like that in the daemon itself.

My proposal then is to remove the buildconf management feature (no longer pushing overrides to buildconf) from autoproj-daemon and adding autoproj-github plugin steps (install plugin and generate overrides) to buildbot-ci before the "autoproj update" step. This also has the benefit of guaranteeing that manually triggering a rebuild from within buildbot's webui will always test things with up-to-date overrides.

We lose the ability of tracking which PRs are known to the daemon by looking into the buildconf branches but that's not something I ever do and I wonder if others would (buildbot's webui is much better for that with the virtual builders thing).

@doudou What do you think?

doudou commented 3 years ago

I actually had this exact same thing on my wishlist: have the daemon only trigger the build, and use autoproj-github to generate the overrides as a step in the build process. So :+1:

g-arjones commented 3 years ago

So, as I started thinking about the implementation discussed here I realised we (maybe I did?) missed something. Integrating autoproj-github with buildbot solves only one part of the problem we have when adding new packages, that is generating overrides for packages that are not defined yet.

But, there's a bigger one to solve: watching events/pull requests related to the new package (that is not defined yet). Buildbot will run tests for the new package when it tests the PR on the package set but it won't test any new commits to the PR on the actual package. This is so bad I'm not even sure it's worth implementing.

Now, one way we could make this work would be to create the concept of a "transient package definition" in autoproj-daemon. These definitions would exist just as long as the PR on the package set that created them remain open. This would require implementing a REST api in autoproj-daemon so buildbot could create the transient packages as a build step when testing package sets.

I'm not sure about the feasibility of something like this for (at least) three reasons:

We need to implement a mechanism in autoproj-ci (or autoproj-github) to detect that new packages were added;
We need this information on buildbot-master (which will finally connect to autoproj-daemon's REST server);
We need to handle, possibly among several other things, conflicting PRs (i.e: multiple PRs that add the same package name but with different branches, urls, etc)

@doudou Thoughts? How complex does that sound to you?

doudou commented 3 years ago

The workflow we have here, related to new package is essentially:

add empty package (through a package set pull request) -> currently successfully tested in CI
merge this pull request
create pull request for "initial implementation" on the new package

The complexity of handling new packages is (currently) so high that I don't think it's worth pursuing.

For what it's worth, I think the current autoproj-daemon architecture is completely wrong. If I had the time to start over, what I would now do is have a central daemon that spawns new processes to handle workspaces. After a successful bootstrap/update, it would read data exported in files (possibly through a separate new command) to get the info it needs to "watch" a given configuration.

It would allow us to have single daemon watch many configurations and efficiently watch common repositories only once. We currently have 8 daemons running, it's not tenable. Moreover, it turns out that git sometimes freezes forever, so I have to kill it manually every once in a while to "unstuck" a particular daemon. The central daemon could more easily detect that an autoproj update takes an unnatural amount of time and kill it.

This would also allow to reasonably efficiently solve the "new package" and "new dependencies" problems: the daemons could actually do an autoproj update for every push and re-discover the dependencies of a particular configuration. Since the workspace will be essentially unchanged, it would be "fast enough" (we would not install osdeps, obviously).

g-arjones commented 3 years ago

The workflow we have here, related to new package is essentially

Yes, this is the only way to make it work properly right now. But sometimes people don't follow that, even more so after years of getting used to only adding a package after the first implementation is ready and opening all PRs at the same time (package set and package's first implementation)

The complexity of handling new packages is (currently) so high that I don't think it's worth pursuing.

That's the feeling I have as well.

For what it's worth, I think the current autoproj-daemon architecture is completely wrong.

I was thinking the same thing but I had a different idea. I wanted to get rid of the concept of watching packages as we do now and start leveraging GitHub's webhooks. They can be set for the whole organization and the job of autoproj-daemon would be to filter GitHub's calls and trigger the appropriate builds on buildbot. We won't be able to watch packages that are not within an organization we control but, TBH that's not a big deal for me. I think it would reduce complexity a lot but essentially means rewriting autoproj-daemon from scratch (this time it would probably be a standalone application and not a plugin anymore). An idea also worth considering is taking the time to really assess the necessary work to fix the "global package registry" issue in autoproj, which would make life much easier especially for your single-daemon-multiple-workspaces use case.

Moreover, it turns out that git sometimes freezes forever

As far as I remember, other than on the occasional workspace update, git is only used by the buildconf management thing to push the overrides file (which was a bad idea, overrides should really be computed at build time)

doudou commented 3 years ago

TBH that's not a big deal for me.

It is one for me. Just thinking about rock-core. It's impractical to think that any organization that wants to use CI and rock should have webhooks on rock-core.

Moreover, I don't see how the webhook improves much. They remove the need to poll repositories, but that's about it. Actually, they don't even do that, webhooks are unauthenticated and therefore a common recommendation is to check whether the hook is "real" by calling the API explicitely. All the rest of the machinery still needs to exist.

I agree that refactoring autoproj to be truly "library" would be great. Would love to see an autoproj 3 to do that. Now, compared to what I propose, it would be probably a few months of work and a lot of changes to autoproj vs. really not much.

g-arjones commented 3 years ago

It's impractical to think that any organization that wants to use CI and rock should have webhooks on rock-core

Then some other strategy has to be implemented since non-members of Rock's organizations can't poll events anyway (see #45).

They remove the need to poll repositories, but that's about it

Also of keeping a PR cache, tracking open PRs, accounting for stale PRs, polling push events, ensuring push events aren't processed more than once, etc. Which, together with the buildconf management (that should still be removed anyway) is pretty much all the daemon does. Besides, we already know GitHub's API isn't meant to be used like that and lost events are common (e.g: https://github.com/rock-core/autoproj-daemon/issues/30, https://github.com/rock-core/autoproj-daemon/issues/45 and https://github.com/rock-core/autoproj-daemon/issues/17).

webhooks are unauthenticated

They can indeed be signed with a secure token.

Now, compared to what I propose, it would be probably a few months of work and a lot of changes to autoproj vs. really not much

I can't argue. Not sure if that would be weeks or months of work but definitely not days. Now, whether it's worth it or not is up for debate. Implementing something that manages multiple autoproj workspaces reliably right now would be very frustrating in my opinion.

doudou commented 3 years ago

They remove the need to poll repositories, but that's about it Also of keeping a PR cache, tracking open PRs, accounting for stale PRs, polling push events, ensuring push events aren't processed more than once, etc. Which, together with the buildconf management (that should still be removed anyway) is pretty much all the daemon does. Besides, we already know GitHub's API isn't meant to be used like that and lost events are common (e.g: #30, #45 and #17).

I'm afraid we would not be able to get away with most of this, even with webhooks. I wouldn't be surprised if webhooks can be lost and/or duplicate, and in any case you have to account for a daemon not being available for a while and still work (i.e. need to do an initial scan). Moreover, we still have to track dependencies internally to know what to trigger on each push (pkg-a, PR#4 might require a build on pkg-b, PR#6 - an information that is not present in the hook's payload)

I know I'm the one that pushed for using the event streams, but then I learned my lesson. We definitely should have been hitting the primary APIs directly. We're basically doing this now systematically to get the commit messages anyways.

g-arjones commented 3 years ago

So, I didn't really have the time to get to this yet but I did some research (mostly trying to understand buildbot's internals) and I am almost sold on the idea of getting rid of the daemon for good.

My plan is to implement a buildbot-autoproj plugin (as a proper python package) that would (in addition to what buildbot-ci's rock.py already provides):

Load installation manifests from a set of autoproj workspaces (kinda like what we've done in vscode-rock)
Provide a ChangeFilter that would filter changes based on package lists loaded from the installation manifests
Provide a BuildbotService that would update autoproj workspaces on changes
Provide extra utilities to help i.e instantiating a set of builders and schedulers based on the packages list

Builds could then be configured to be triggered either by web hooks or by polling-based change sources. Buildbot already provides a GitHubPullRequestPoller but for the polling-only use case we could also implement a GitHubBranchPoller since GitPoller is not scalable IMHO.

As far as I can see, buildbot already provides most of the infrastructure we would need for all of this.

What do you think? Would you be interested in helping with such a project?

Moreover, we still have to track dependencies internally to know what to trigger on each push (pkg-a, PR#4 might require a build on pkg-b, PR#6 - an information that is not present in the hook's payload)

If I understood what you mean, we don't need this information to trigger builds. We need this only to compute overrides (which would be done as a build step)

doudou commented 3 years ago

What do you think? Would you be interested in helping with such a project?

I'm not sold, no. What I wanted to have is something that could be ported relatively easily to something else than buildbot. I've been bitten too many times with following a CI system to finally find out that I have to migrate to another one.

The daemon / cache / postprocess scheme we have right now fits the bill. Any CI system that can be triggered externally can be used. I'm still using buildbot because of the worker infrastructure (kube + ARM builds), which would take quite a bit of time to setup elsewhere.

To be honest, it smells like the proverbial "let's rewrite all of it, it's going to be so much better". Except I personally believe it won't. What's the advantage of re-writing the daemon as it is in buildbot ? I don't see anything that will get better.

Finally, if I were you, I wouldn't spend time doing any of this until I've checked that it's practical to have hundreds of git pollers in buildbot (through theirs or your own)

doudou commented 3 years ago

If I understood what you mean, we don't need this information to trigger builds

Well. Yes, we do. If PR#A depends on PR#B, you want to trigger the build for PR#A when PR#B is changed.

g-arjones commented 3 years ago

To be honest, it smells like the proverbial "let's rewrite all of it, it's going to be so much better"

This is really not my goal. The problem right now is that maintaining the daemon and fixing all the issues (the ones open here and others) seem unfeasible for both of us. Most importantly:

Keeping several instances of the daemon really doesn't work (I don't know how you've been dealing with this for so long). Every single instance poll the same repositories, deploying is a pain, etc.
It only works with GitHub.
It doesn't support webhooks.

These are the most pressing for me at the moment.

Plus, as far as I can tell, what I am suggesting is not a rewrite of the daemon in buildbot. Most of the things are already there, except for the installation manifest parser (and the branch poller to fit your use case). The workspace updater service is really just receiving events from change sources and shelling out to autoproj (could even be a custom scheduler). Sounds like it would be a more manageable codebase.

Well. Yes, we do. If PR#A depends on PR#B, you want to trigger the build for PR#A when PR#B is changed.

That would make sense (we don't do that in autoproj-daemon) and could be done as a build step as well

doudou commented 3 years ago

I believe the path from what we have to what we want to have is a lot shorter working with the daemon, plus the added feature of not becoming super dependent on buildbot. You point out yourself that it's quite a bit of work and I agree. I think that what you underestimate is the amount of work needed to get it done from scratch in buildbot itself. 90% of the daemon complexity was not and is not the access to the GH API. It is, and will remain, the logic.

I think you have to look at your priorities really hard and decide for yourself. Polling vs. webhooks is an itch (it's inelegant). It can't possibly be a pressing issue (or you have too much time on your hands). If you need to work outside Github, that's definitely a problem. But then, removing the need to create the overrides branch will simplify the daemon quite a bit.

Now, if you think you can scratch your itch by doing it inside buildbot, suit yourself.

For what it's worth, webhooks is not my cup of tea at all because only org owners can install them - and I definitely don't want every single org that uses rock to have to install webhooks on the rock orgs.

That would make sense (we don't do that in autoproj-daemon) and could be done as a build step as well

??? I do believe that we do. My grip currently is that what I wanted, actually, is that a push to PR#B only triggers PR#A. I'm getting supremely tired of having a PR build fail because changes are also needed downstream.

g-arjones commented 3 years ago

I think that what you underestimate is the amount of work needed to get it done from scratch in buildbot itself

I think that's pretty much the point. Either I underestimate it or you overestimate it.

It can't possibly be a pressing issue

It is once you have to set your polling period to 30+ minutes because you have multiple daemon instances polling rock-core, rock-drivers, rock-control, and rock-gazebo multiple times (not to mention that this polling is useless and just a waste of time/api calls because of #45). Having to wait 45+ minutes to get test results back is not cool.

removing the need to create the overrides branch will simplify the daemon quite a bit.

Yeah, that will be my top priority if I decide to keep using the daemon.

I do believe that we do

I don't remember implementing anything like that myself. Unless you did, I really believe the daemon doesn't do that. We do rebuild if the resulting overrides change (because i.e a dependency was merged, removed, or added), but not because a PR dependency head changed. We don't even keep dependencies between PR's stored anywhere (which in hindsight, we should).

So, In summary, to get the daemon to where we both want we would have to:

Remove event polling;
Remove "buildconf management";
Figure out a way to have a single daemon managing multiple workspaces (thus fixing the polling of the same repository multiple times problem);
Add GitLab support

That's quite a bit of work...

doudou commented 3 years ago

I don't know why you have 30+ minutes polling periods. I had 17 daemons at the end of last year, all of them pretty much polling the same repositories, and they were all polling at the default 1 min. Polling is not rate-limited and essentially free as long as nothing changes (since we're using the cache headers).

(1) is "replace event polling by straight up pull request polling". We can't avoid polling because of the limitations of webhooks I talked about earlier in this thread. (2) I don't know what that is ... the branch generation ? (3) Agreed. I already sketched how it would work. (4) That's on you ... I really don't have that one.

Last, but not least, given our limited resources, I believe an incremental approach will be much better than the "buildbot rewrite". Most of that stuff can be done and released step by step:

remove branch generation (replaced by autoproj github in the build itself)
replace event polling by pull request polling. This is, by the way, already pretty close to what the daemon does. Since the last refactoring, we use events as indicators of which pull request should be queried, but we then read the pull request and act on that data alone. I guess (it's been a while) that transforming to remove the event stream completely should be reasonably simple.
abstract away the github source to write a gitlab adaptor
separate the daemon from the autoproj workspace, by calling subprocesses to bootstrap/update/... and then reading the installation manifest (in preparation of "single daemons for multiple configs). Add a bloody timeout and kill the subprocesses when we reach it (that's mine ... git sometimes freezes during updates and I have to kill it manually)
meta-daemon

My final word is this: if I did have the time to work on that stuff right now, I would really scratch the biggest problem I have, which is the very sorry state of reporting. The first one is obviously having builds fail even if the failure is in a cached package. The second one is to have something to browse build results. This is for me the biggest pain point. The rest is annoying, but for us at tidewise, it is not pressing.

g-arjones commented 3 years ago

(1) is "replace event polling by straight up pull request polling". We can't avoid polling because of the limitations of webhooks I talked about earlier in this thread.

That is replace event stream polling by branch head + pull request polling, not by web hooks. We really don't have a choice here since polling events is broken when the API key owner is not in the organization.

(2) I don't know what that is ... the branch generation ?

That would be what BuildconfManager does. Yes, branch generation and management (deleting the branch when the PR is stale, creating it back when activity resumes, deleting when PR is either closed or merged).

Since the last refactoring, we use events as indicators of which pull request should be queried, but we then read the pull request and act on that data alone. I guess (it's been a while) that transforming to remove the event stream completely should be reasonably simple.

It's been a while indeed but I have the feeling that what we want is actually closer to calling this periodically (after removing the buildconf branch stuff and adding a step to check whether local and remote heads are different using the API). It's essentially doing periodically what the daemon currently does on every start (triggering build on new and updated pull requests) + triggering builds for packages that changed.

meta-daemon

What's that?

The second one is to have something to browse build results.

This one is not that bad for me. I guess that would be because I didn't pull your last changes to buildbot-ci (which I think increases the number of builds shown in the UI).

doudou commented 3 years ago

It's been a while indeed but I have the feeling that what we want is actually closer to calling this periodically

I think you're on to something. Why not simply re-build the complete state each single time ? Only keep over time a fingerprint of what triggered the last build for a particular pull request, rebuild state from scratch, re-compute the fingerprints and trigger if things "have changed". As long as we make sure to keep the same HTTP middleware (to use caching), it should be pretty reasonable and so much simple.

We'll be wasting some CPU time for sure, but the graphs won't be so big that it will matter.

That's pretty powerful stuff.

meta-daemon What's that?

A bad name. Was supposed to be the daemon that handles more than one configuration.

This one is not that bad for me. I guess that would be because I didn't pull your last changes to buildbot-ci (which I think increases the number of builds shown in the UI).

Oh. I'm not even complaining about speed. I'm complaining about having to re-trigger builds because the one I want to see is already out of the shown builds ...

g-arjones commented 3 years ago

I think you're on to something.

See this

g-arjones commented 3 years ago

I can't spend more time on this, hopefully you can finish it. It's missing:

Remove buildconf branch management
Single daemon, multiple workspaces.

With the extra layer of abstraction over the git APIs, I don't think it makes sense to have all this code duplicated in a separate plugin anymore (to generate the overrides). Instead, I suggest we keep the code that computes the overrides here and send the result as a json object to buildbot in the change hook payload. There's a "properties" field that seem to exist exactly for this kind of use case.

doudou commented 3 years ago

There's a big advantage to have the overrides computed in a separate plugin ... it's possible to trigger a build manually and have it pick up the "right" overrides. I've been missing that a few times, whenever the daemon wasn't working properly.

doudou commented 3 years ago

Just for my understanding: have you managed to test the PR ? I'm kind-of assuming the whole gitlab thing needs real-life testing somehow ...

g-arjones commented 3 years ago

it's possible to trigger a build manually and have it pick up the "right" overrides

Indeed. Well, take a look at the code once you get the time and maybe you can come up with an idea on how to split this. I mean, the gitlab stuff definitely does not belong in autoproj-github...

have you managed to test the PR ?

Yes

doudou commented 3 years ago

If your problem is with "autoproj-github", what about having it as a subcommand of autoproj-ci ?

g-arjones commented 3 years ago

I don't mind having overrides computed in autoproj-ci. Makes sense, actually.

But that's not the problem. The problem is that there is code that will be needed in multiple places:

The whole git abstraction is necessary in the daemon and wherever the overrides are generated.
The code that parses the pull request body (recursively) is necessary in the daemon even if it's not generating overrides (to track dependencies between PRs)

I just want to avoid too much duplication

doudou commented 3 years ago

Well ... to avoid duplication, we can have one package dependent on the other and reuse the code.

g-arjones commented 3 years ago

Exactly. I think (at least) two new packages will be necessary.

doudou commented 3 years ago

I personally would not mind autoproj-daemon depending on autoproj-ci.

Other than that, there is already an autoproj-git with a limited set of features. We could move the functionality there.

g-arjones commented 3 years ago

Last, but not least, given our limited resources, I believe an incremental approach will be much better than the "buildbot rewrite". Most of that stuff can be done and released step by step:

remove branch generation (replaced by autoproj github in the build itself)

replace event polling by pull request polling. This is, by the way, already pretty close to what the daemon does. Since the last refactoring, we use events as indicators of which pull request should be queried, but we then read the pull request and act on that data alone. I guess (it's been a while) that transforming to remove the event stream completely should be reasonably simple.

abstract away the github source to write a gitlab adaptor

separate the daemon from the autoproj workspace, by calling subprocesses to bootstrap/update/... and then reading the installation manifest (in preparation of "single daemons for multiple configs). Add a bloody timeout and kill the subprocesses when we reach it (that's mine ... git sometimes freezes during updates and I have to kill it manually)

meta-daemon

My final word is this: if I did have the time to work on that stuff right now, I would really scratch the biggest problem I have, which is the very sorry state of reporting. The first one is obviously having builds fail even if the failure is in a cached package. The second one is to have something to browse build results. This is for me the biggest pain point. The rest is annoying, but for us at tidewise, it is not pressing.

@doudou Is this stuff still in your backlog (minus 2. and 3. that I have implemented already)? Also, just out of curiosity, are you running an up-to-date autoproj-daemon (from master)?

doudou commented 3 years ago

Well. Theoretically it is on it, but to be honest the daemon works just fine for me right now - a.k.a. it's at the very bottom of the list. Would I be to allocate time on CI, that would be on the report management which is really a pain.

Also, just out of curiosity, are you running an up-to-date autoproj-daemon (from master)?

I thought I was, but we are still running the pre-merge version of #52

rock-core / autoproj-daemon

Improve how new packages are handled #46