nodejs / TSC

The Node.js Technical Steering Committee
591 stars 133 forks source link

General Build WG issues #562

Closed Trott closed 5 years ago

Trott commented 6 years ago

I believe there's general agreement about certain issues surrounding Build WG.

I'm raising this to the TSC rather than Build WG because Build WG may not have the bandwidth (or resources) to deal with these issues. (These are not new issues.)

I don't have any solutions to propose, at least not yet. (The obvious one--paying people to monitor the Jenkins infrastructure--has been discussed elsewhere and TBH I don't even remember if the last time it was discussed, the conclusion was "no" or "let's ask the Board" or what. I think @mhdawson was the instigator of that last go-around on that conversation, so maybe he remembers.)

mcollina commented 6 years ago

I would also note that partecipating in build is a “silent” job, it does not have much recognition and it does not attract much contributions. As an example, we are not listing build wg members in nodejs/node.

I think we should promote (active) build wg members more.

bnb commented 6 years ago

@mcollina I've not participated in build at all, but that's entirely my understanding as well. Zero thanks are given for one of the most vital and important pieces of our infrastrucutre... especially not to those who dedicate a non-trivial amount of time.

From a CommComm perspective, definitely +1 to actively promoting the Build WG members more.

mhdawson commented 6 years ago

I did start the discussion on "paying people" last time (this is the issue https://github.com/nodejs/build/issues/1154), but I probably did not push hard enough to keep the discussion going.

I did talk to the executive director and it sounded like it was a possibility.

The stumbling block was the concern that paying somebody might deter/cause others to be less willing to volunteer on the build WG side and the slippery slope once you start paying some collaborators.

The challenge I see is that volunteers are a good fit "when I have time, I'll look at what needs to be done and do that" as opposed to "It has to be done now, drop everything". Of course if the "drop everything" is infrequent it still works out ok.

It would be good to have the "impact" of our current state of availability captured somewhere (I'd suggest this issue). As an example, is that it is frustrating because things don't get fixed right away which slows things down (or slows people who currently have time to work on something at a particularly time), or is it that things don't get fixed at all and we are getting into a worse and worse state? The build WG clearly understands there are lots more things that the project (and the build WG) would like to see happen, but I think identifying the top X impacts to day-to-day work that make it urgent for the TSC to step in an help push forward change would help focus the discussion.

Ideally, the best answer is the enable people to help themselves when they come across a problem as opposed to needing to call somebody else in. On that front @gdams it starting to look at setting up ansible tower so that we can let people run cleanup type work more easily but its certainly not going to be a silver bullet and is going to take time to make progress.

We did have a number of new people volunteer and I think most of them are now onboarded (although I could be wrong on that front). This came out of the collaborator summit which does show that more promotion can help so definitely +1 on that front as well.

Trott commented 6 years ago

It would be good to have the "impact" of our current state of availability captured somewhere (I'd suggest this issue).

I think the more telling impact isn't the impact on the project but the impact on the Build WG itself. I think people burn out and check out, even if they don't say as much. I think also recent friction between the two most active folks on the WG probably stem from some or all of these issues on some level. (I'm not sure if they'd agree with that.) It also makes recruiting and onboarding difficult, thus perpetuating the problems.

We did have a number of new people volunteer and I think most of them are now onboarded (although I could be wrong on that front).

Two of the four of them have been onboarded. That we got four volunteers was a result of an extra push by Refael, Tierney, me, and probably others during the Collaborators Summit.

mhdawson commented 6 years ago

If the impact is on the Build WG itself, can we change the expectations on "current state of availability". ie set the expectation that somebody may not be available at all times and people just have to wait? If the project is mostly moving forward as it needs to, but the Build WG feels under stress that might help. If changing that expectation is not reasonable, then that supports the case that we need to get certain things done in a different way (and we'd need to identify those things).

The issue starts out by indicating that the problem is that most people are around only sporadically (I think the expectation in most parts of the project is that people will only be around sporadically). To me the expectation that people in the build WG will be "more available" might be part of the cause of burn out in the build WG... (I also understand why there might be this expectation, as we want builds etc to keep moving forward) Maybe I'm misunderstanding what you meant by sporadically. I'm interpreting it meaning that people are only available asynchronously. I also understand that things not being in a good state is also cause of burn out, even without external pressure so it might just be that members get frustrated because things are not in as good a state as they would like them....

Trott commented 6 years ago

Maybe I'm misunderstanding what you meant by sporadically. I'm interpreting it meaning that people are only available asynchronously.

Lots of subtleties here. First, the people who need Build WG folks available are often other Build WG folks. This is especially true when dealing with the super-privileged infra that is used for releases.

Second, it's not that Old Timer Ted isn't around at a convenient time of day. It's that they might not be paying attention to build stuff for days or weeks at a time.

Third, yes, we can change the expectation that someone will get on a problem within N hours or whatever. But only if we're willing to slow the velocity of the project. I'd actually advocate for that, but I don't think there's much of an appetite for it on the project. I think people might accept that as a temporary measure but I don't know if people would be enthusiastic about it as a permanent solution.

To be honest, though, talking to @maclover7 and @refack about these issues might be more useful than talking to me. They may have different ideas about solutions and whatnot, but I suspect that they would be in agreement about the issues.

maclover7 commented 6 years ago

First, just want to say I'm really happy we're continuing to have these conversations, and to try and work through difficulties facing the Build WG. Most of this stuff is not easy to solve, and continuing the dialogue is very important, at least to me.

To put a face on the issues (my intention is not to make this all about me, or overgeneralize what's happening, but I think it would help to give at least one person's perspective), here is a quick-and-fast list of my current difficulties:

Like @mcollina and @bnb mentioned, this is largely "silent"/"hidden" work with little-to-no project recognition, which makes it tough to attract contributions. I recently onboarded two new members (Matheus Marchini and Luca Lanziani), and both seem excited about contributing.

Something that might be good to do would be to reset the relationship with users of Build WG services, and establish a more formal "contract" (read: listing out expectations for everybody involved). Maybe this should be done for the Build WG itself, as well? At that point, IMHO, we can figure out how to better use existing or new WG resources (machines, volunteers, Foundation $$) to get over that finish line.

mhdawson commented 6 years ago

@maclover7 thanks for adding your prespective.

In respect to

Higher-up-infra is gated to a very small group of people, many of whom have not been involved in any form for months or years

I think part of the problem on that front is the visibility of what people are doing. Looking at the list of people list as Infra Admins, I know that other than one person who's been pulled by their current job, people have been active in the last few months (agreed not to the level we'd all like to be). The challenge is that everybody does have a lot on their plate and prioritizes what they get done which may not match up with what other people want/need them to do in order to push forward what they are working on. So I'm not disagreeing that having more people who can help on this front is not a good idea, only that saying that people are completely disengaged is not fair either.

mhdawson commented 6 years ago

I think this is a key part of the discussion

But only if we're willing to slow the velocity of the project. 

I think it's reasonable to achieve a certain level with volunteers. If that level does not match the expectations for the project then we need to either adjust the expectations or look for other ways to meet the expectation.

I'd agree with you that we should consider the "slowing the velocity of the project". Maybe starting by defining what is reasonable given the current volunteers, and proposing we formalize that as a way to have the discussion about whether we slow the velocity or find another solution.

Trott commented 6 years ago

Relevant to this discussion: https://medium.com/@Trott/on-landing-code-when-ci-fails-f3aa999cda3d

@mhdawson That's how I think we should throttle velocity, FWIW.

mhdawson commented 6 years ago

@Trott do you think there is a better way to have the discussion about adopting that approach other than just opening a PR to update our onboarding/guidance to state that is the approach along with some of the context? I know there might be a fair amount of discussion, but opening the PR is likely the best way to get it started.

refack commented 6 years ago

From my POV, the situation has improved drastically in the last couple of months.

  1. The status of the test CI cluster seems to be converging towards a minimum of spurious fails.
  2. Number of reported incidents in GitHub & IRC has reduced.
  3. @nodejs/build-files is used more, and is replacing @nodejs/build, so pings have slowed down to a manageable number (less than 5 a week).

This might be due to two reasons. Hopefully it's because capacity is slowly catching up to demand. Alternatively it's because the Collaborators have given up on the infra. We're trying to better understand which one is it...

IMHO better focus and re-aligning of expectations will eliminate the pressure on the Build team.


As a reminder the Build team is tasked with facilitating two seperate tasks:

Since the second task is far less frequent, and can be coordinated, and performed by experienced users, IMHO it could receive lower priority for the time being. So as I see it stabilizing and then improving CI testing should be the main focus for a while. For that we need better feedback, tracking & reporting, and managed expectations.

Trott commented 6 years ago

@Trott do you think there is a better way to have the discussion about adopting that approach other than just opening a PR to update our onboarding/guidance to state that is the approach along with some of the context? I know there might be a fair amount of discussion, but opening the PR is likely the best way to get it started.

@mhdawson That PR already happened, although arguably it snuck in under the radar (although I don't think opening a PR is sneaking anything--then again, it may have been insufficiently clear in the title what was going on?). It was really two PRs. First https://github.com/nodejs/node/pull/19458 and then further tightening in https://github.com/nodejs/node/pull/21645.

What might be good now, assuming there is buy-in on this practice, is maybe to announce it in the discussion board for Collaborators.

mhdawson commented 6 years ago

I'm guessing many people are not going to be aware since there are so many PRs, there is no way we can reasonably expect everybody to keep on top of all of them, particularly collaborators who have more limited time to contribute. Even though I try to read the titles of every Issue, 21645 still slipped by me and I only learned about the "Resume build" (which is great !) from another collaborator last week. Part of that might be that is was only open for 2 days so you had to catch it during that window.

Since it's a change from past expectations, I think we need to be messaging the whole collaborator base, most likely a number of times until we see behavior change. That should either help people become aware and start following the new practice or ignite discussion which we need anyway if we don't have buy in.

Might even be good to have something on the page for starting the build that says "New" please read. Kind of like signs that advertise when new stop signs are added.

mhdawson commented 5 years ago

Given that we have added a Strategic initiative in https://github.com/nodejs/TSC/blob/master/Strategic-Initiatives.md to look at Build resources can this be closed and have ongoing discussion covered in that initiative?