Unresponsive maintainers and unclear code review SLO

FighterNan commented 1 year ago

First of all, this issue that I created is not to blame any specific maintainers. The examples changes and Discord links that I used are just for reference to start the discussion.

The high level issue that I found is some maintainers don't respond to code review or/and design doc quickly enough due to some reasons: busy with own work, on vacation, etc. I am not clear what the code review SLO is and, whether the maintainers will get warnings or not if they don't serve their responsibilities

If existing maintainers don't respond, it's very difficult for new features to get in. It then also becomes a challenge for people who want to become a new maintainer to convince the forum that they are qualified without enough technical contribution.

The examples that I have are in the phosphor-user-manager repo,

there are a quite a few changes that don't get response from existing maintainers for a while; https://gerrit.openbmc.org/q/project:openbmc/phosphor-user-manager+branch:master+status:open+is:mergeable; some of them don't have any contention and can be merged as is
there is discussion topic that the maintainers don't respond; for example, this bug (https://discord.com/channels/775381525260664832/775381525260664836/1021304417779204186) and this design (https://gerrit.openbmc.org/c/openbmc/docs/+/56401)
owners of changes or designs have pinged maintainers quite often; for example, https://discord.com/channels/775381525260664832/867820390406422538/1018972513679708253

Should the technical-oversight-forum discuss this general issue and come up with a solution or guideline for the OpenBMC community?

edtanous commented 1 year ago

The technical-oversight-forum has talked about this issue at length, so if the request is "please talk about this" I think we can say that we already have, and certain proposals have already been tried to make this incrementally better for folks, and if we look at the contributor data, we've made progress on making this better in the last year. We have specifically discussed phosphor-user-manager, and we agree, there's some action that needs to be taken there.

I'm struggling a little bit, because the request above seems like mostly commentary, and isn't proposing something that you would like a decision made on. If you'd like the TOF to make the statement that "things could be better", I think the whole TOF is likely in agreement that someone could certainly improve things, but it's difficult to have the vote you're proposing without a concrete proposal being made. If you have policy changes that you think will make maintainers more likely to respond, please propose them, ideally starting with a review, then if the changes you're proposing are contentious, the TOF is happy to help up/down vote to remove the contention, but as written, I don't see anything requesting an action other than discussion, which anyone is free to start.

Some offhand ideas that I've personally thought about before to try to make this better:

Mono-repo, which makes the escalating review responsibility clear, and when maintainers fail to reply, the per-review escalation paths are clear. This was talked about on the mailing list a while back, it'd be good to get your input on it in that forum if you think it would help. Ironically, it suffered from a lack of community input, IMO, because the people that needed it the most had already forked the project.
Document submitter expectations, with requirements (reviews, activity, ect) for community interaction that maintainers can use to prioritize reviews for those that are actively participating, and discourage or deprioritize "drive by" reviews. A lot of these are already documented, we just don't have a lot of data on when the guidelines are broken.
Use calendar-based merge windows (similar to the Linux kernel), where we make clearer what the review SLO expectations are, and prioritize active patches higher, as well as have a dedicated time for testing and integration. Possibly also adopting a pull request model, to allow long-lived branches to do development async among the group of individuals, then request merges within the window.
Requirements around collaboration of design docs submissions regarding engaging the relevant experts and standards prior to authoring a design doc, to reduce the time a design doc takes to execute. This already was implemented to some extent (https://gerrit.openbmc.org/c/openbmc/docs/+/55831) But unfortunately was ignored in the design doc you linked, so we don't have great data on whether or not that would've reduced the time it took to merge that particular design.

Do you think any of the above would help?

amboar commented 1 year ago

I think the first step is to get consensus on what we think is the maximum latency for feedback across the project. Once we have that in place, it's a small step to accept that (regularly) failing to meet these latency requirements means that there's now a failure of consensus as to who the maintainers are for the repository at hand. One of the roles of the TOF is to handle failed consensus, and I think we can leverage that to define a new process for handling transfer of maintainership.

I think this would be helped by defining different maintenance states for repositories. Linux, for instance, uses the following set of states:

    S: *Status*, one of the following:
       Supported:   Someone is actually paid to look after this.
       Maintained:  Someone actually looks after it.
       Odd Fixes:   It has a maintainer but they don't have time to do
            much other than throw the odd patch in. See below..
       Orphan:  No current maintainer [but maybe you could take the
            role as you write your new code].
       Obsolete:    Old code. Something tagged obsolete generally means
            it has been replaced by a better system and you
            should be using that.

Maybe we don't need all of these. Addressing the issue at hand, a subset as small as { Maintained, Orphan, Obsolete } is probably enough. With the TOF in the picture, (enough, N) feedback deadline misses trigger the transfer of maintainership from the current set of maintainers to the TOF. The TOF does not necessarily maintain the code directly, but exists to accept a new set of maintainers for the repository.

stateDiagram-v2
[*] --> Maintained: Create project repository
Maintained --> Orphan: N feedback deadline misses
Maintained --> Obsolete: Project superseded
Orphan --> Maintained: TOF selects new project maintainer(s)
Orphan --> Obsolete: TOF finds no suitable candidates for project maintenance
Obsolete --> [*]

I don't think it's reasonable that one deadline miss triggers this process, life happens. We'd probably want to come up with some number (N) that people agree substantiates maintainers not performing their role. I also think it's reasonable that if provided with notice of absence (and possibly temporary handover of duties) that the process could be made with more judgement applied than rules. If maintainers are communicating that well then they're unlikely to run into problems anyway. Having more than one maintainer for a project mitigates this problem, for the most part.

Further I suggest this proposal extends to the complete set of maintainers for a repository. Feedback deadlines are only missed if no maintainer for the project provides feedback inside the feedback period. All maintainers are removed in the event of N deadline misses as it must be the case that they are all unresponsive. This feels intuitive, and provides a way maintainers can continue to sort out who's active and who's not among themselves unless they all become inactive, and therefore are a limiting factor for the project by their combined unresponsiveness.

I expect that the TOF would elect people actively pushing patches to become the new maintainers for the project at hand.

williamspatrick commented 1 year ago

We'd probably want to come up with some number (N) that people agree substantiates maintainers not performing their role.

We may want some ratio and/or misses per time-unit. I know even some of the highly active maintainers miss things or have a misunderstanding on next-steps.

Is this something we'd attempt to pull from Gerrit data or via some sort of reporting system?

williamspatrick commented 1 year ago

All maintainers are removed in the event of N deadline misses as it must be the case that they are all unresponsive. This feels intuitive, and provides a way maintainers can continue to sort out who's active and who's not among themselves unless they all become inactive, and therefore are a limiting factor for the project by their combined unresponsiveness.

A minor contrarian side here is that I think some maintainers effectively have their names on things in emeritus status because remaining maintainer wants to deal with the people-side of pulling their name off.

Would we want to view these deadline misses / maintainer removal per-person? As in, if a person becomes inactive they are removed from maintainership everywhere? I say this because my feeling is If they don't intend to support some code they should be more proactive at indicating this, and my experience says that someone who is unresponsive on 1 repository isn't likely to be responsive on a different one (except in a few cases when the responded-to one is related specifically to something only their company cares about).

amboar commented 1 year ago

We may want some ratio and/or misses per time-unit. I know even some of the highly active maintainers miss things or have a misunderstanding on next-steps.

Yeah, definitely. I tend to practice complaint-driven review :sweat_smile: which would probably fall afoul of all this. I'm happy for people to hassle me in DMs or by @-ing me on Discord on reasonable timeframes (a week or two of no feedback).

Is this something we'd attempt to pull from Gerrit data or via some sort of reporting system?

Just to clarify, I think we should have a documented process such as what I suggested, but that it is really a process of last resort. We should work to exhaust all the usual social/communication avenues first and make sure that maintainers at risk cannot claim that it was a surprise. Contributors should try to work with them in a proactive way to step up and take on maintenance responsibility if they feel there's not enough progress, before we get the point of needing to invoke something like this.

I feel like mining gerrit data should only be resorted to if we want to substantiate a (soft) claim that repositories are being neglected. It should be complaint-driven, effectively, not some cron job that acts as a maintenance cop.

As in, if a person becomes inactive they are removed from maintainership everywhere? I say this because my feeling is If they don't intend to support some code they should be more proactive at indicating this, and my experience says that someone who is unresponsive on 1 repository isn't likely to be responsive on a different one (except in a few cases when the responded-to one is related specifically to something only their company cares about).

I think this is probably a bit agressive. If maintainers are being broadly absent this will fall out in the numbers anyway. Let's not get ahead of ourselves :)

FighterNan commented 1 year ago

Great discussion! Right, in this issue, I didn't propose any ideas to solve this. I created this TOR issue to make sure that we not only discuss any solution, but also get it implemented at the end.

From the discussion above, I want to highlight things that I believe that will help.

clearer review SLO expectations or maximum latency for feedback
deadline misses / maintainer removal per-person

bradbishop commented 1 year ago

I think this is probably a bit agressive. If maintainers are being broadly absent this will fall out in the numbers anyway. Let's not get ahead of ourselves :)

👍

FighterNan commented 1 year ago

Any updates or decisions made in the TOF meeting?

williamspatrick commented 1 year ago

Any updates or decisions made in the TOF meeting?

There wasn’t anything actionable yet. I think someone took an action to refine the proposal from @amboar.

amboar commented 1 year ago

Right, I'll push up a doc, but I haven't had time to write it just yet.

amboar commented 1 year ago

I've pushed a proposal here: https://gerrit.openbmc.org/c/openbmc/docs/+/58653

Having thought through the problem, spoken to some people and considered the trade-offs, what I've proposed doesn't take the path of removing unresponsive maintainers like I outlined above. Instead, it just describes a process for on-boarding new maintainers.

williamspatrick commented 1 year ago

I think this can be closed with the action taken being the merge of @amboar's propsal?

openbmc / technical-oversight-forum

Unresponsive maintainers and unclear code review SLO #20