kubernetes / community

Kubernetes community content
Apache License 2.0
11.9k stars 5.15k forks source link

reconsider posting CI comments on every test failure #3621

Open BenTheElder opened 5 years ago

BenTheElder commented 5 years ago

Currently Kubernetes PRs get a lot of comments from @k8s-ci-robot, many of which are to notify of failing tests. This tends to drown out actual human interactions.

I know this came up at some point in the past, but I'm having difficulty finding why we do this and I'd like to revisit it.

Right now prow deletes it's job-results comment every time results change, and posts a new comment if there are any failures. This can be very spammy (see below), and punishes reviewers and approvers for being subscribed to PRs with tons of excess notifications.

Most (all?) other CI on GitHub merely post to the "status" or "checks" APIs (https://help.github.com/en/articles/about-status-checks) which Prow / @k8s-ci-robot also does. We should consider only doing this instead of posting comments on every failure to reduce pressure on developers's notifications.

Additionally, the comments can be misleading with stale state in them https://github.com/kubernetes/test-infra/issues/4602

If we agree that this is reasonable, it should be relatively straightforward to make it possible in test-infra.

BenTheElder commented 5 years ago

/sig contributor-experience /sig testing

Example: EDIT: note that after a page refresh most of these comments will go away (the bot deletes them), they will not however leave your inbox :-)

image

dims commented 5 years ago

yes please +1 to test drive this in test-infra.

guineveresaenger commented 5 years ago

+1 if only to get folks off my back that it's the welcome message that's at fault 😝

cblecker commented 5 years ago

A few thoughts:

BenTheElder commented 5 years ago

This will impact folks who have a primarily e-mail driven workflow. It may impact notification driven workflows too, as I'm not sure if a context update triggers the notification bell, even if not an e-mail.

Those workflows are likely broken to begin with, as we do not post a comment when tests pass. You can only know that they failed at some point.

The example is a bit misleading, as you'll only get a screen like that if you don't refresh the GitHub window. The previous comments would get deleted/edited so that there aren't duplicate comments from the bot.

You get a notification for each of these comments because the comments are deleted and a new one posted, which was the intention of the example. Refreshing and seeing less comments visibly is correct though, but you were still notified for each of these.

The one part that I personally really like about the bot's comments is the clear notification of "this is what failed, on which commit, here's a link to the log, and here's the command to re-run". I realize that the "what failed" and "link to log" would still be available in the status contexts section, but you don't get the clarity of which commit the posted status is for, or the retest command in the status contexts.

Breaking that down..

this is what failed

This is in the status contexts

on which commit

You should be able to see the commit in the link, or we could put it in the description.

here's a link to the log

Log links are in the status contexts.

and here's the command to re-run".

We post a welcome comment with instructions for the other commands. Telling users how to rerun something should not require a comment for every failure.

Note that the comment table values are also often misleading with stale results.

The status context section doesn't appear on mobile unless you have direct write access to a repo, so reviewing test results on mobile would be much move difficult (I will grant this is likely an edge case).

Desktop view works on mobile, github even suggests it now at the bottom of the page.

vllry commented 5 years ago

Personally I get some use from the notifications - IMHO it would be nice to get a single comment, such as "1 test failed and more an ongoing" without comment followup, or "here's the complete status report". But I'd rather axe them all than keep getting hammered by prow as-is. :|

cblecker commented 5 years ago

We post a welcome comment with instructions for the other commands. Telling users how to rerun something should not require a comment for every failure.

Note that the comment table values are also often misleading with stale results.

We post the specific tests, and the specific retest comments (e.g. /test pull-kubernetes-verify). This make a copy paste really easy and clear.

The status context section doesn't appear on mobile unless you have direct write access to a repo, so reviewing test results on mobile would be much move difficult (I will grant this is likely an edge case).

Desktop view works on mobile, github even suggests it now at the bottom of the page.

It's still a valid criticism of doing away with the comment table. The desktop view isn't great on mobile. As I said previously, it's an edge case.. but it's still a valid one.

nikhita commented 5 years ago

Related https://github.com/kubernetes/test-infra/issues/12488#issuecomment-489797420

BenTheElder commented 5 years ago

We post the specific tests, and the specific retest comments (e.g. /test pull-kubernetes-verify). This make a copy paste really easy and clear.

IMHO this is still exactly what the welcome bot should be for, or we can put the copy paste command on the spyglass page. This does not have to be a repetitive comment.

It's still a valid criticism of doing away with the comment table. The desktop view isn't great on mobile. As I said previously, it's an edge case.. but it's still a valid one.

We have friends at GitHub, IMHO we should not "fix" their mobile UX by spamming inboxes 😞

BenTheElder commented 5 years ago

I would also like to point out that roughly "GitHub notifications are a tire fire" is frequent complaint of Kubernetes project members, such that many of them simply don't use GitHub notifications in any form. I think this is detrimental to the project.

More specifically Prow contributes a very large number of those comments in ways that would be unusual in other projects, a major one being commenting after every individual test failure which notifies everyone on the PR.

On nearly any other project I would just check the status contexts if I'm concerned with the test results, without it sending any comments to the whole thread.

I've personally resorted to specifically blocking Prow comments to the extent possible but I think this UX is bad for everyone. https://github.com/kubernetes/test-infra/issues/12488#issuecomment-493117263

We've also found that even highly experienced contributors don't notice the retest commands in these comments. I would hazard that this is because they already mentally block out these comments as spam. https://kubernetes.slack.com/archives/C09QZ4DQB/p1557951304418200

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

nikhita commented 4 years ago

/remove-lifecycle stale

cblecker commented 4 years ago

asking some questions about this on the contributor survey https://github.com/kubernetes/community/issues/3969

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

nikhita commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

cblecker commented 4 years ago

/remove-lifecycle rotten

spiffxp commented 4 years ago

FWIW, I find spyglass' PR history page to be really useful for keeping track of what ran / passed when on my PR. I most often get there via a link in the "tests failed" bot comment.

eg: https://github.com/kubernetes/kubernetes/pull/93549#issuecomment-666568900 -> "Full PR test history" -> https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=93549

spiffxp commented 4 years ago

The other way I know to get there:

left = more recent

cblecker commented 4 years ago

Sorry for the delay in coming back to this issue. We ran questions on this in the 2019 Contributor Survey, and they were posted back in March -- but with the state of the world, this dropped off my radar.

In reviewing the data we were able to glean from the survey, while it's clear that there are those in the community that see these notifications as a problem, the majority of the community do not see the numbers of notifications as a problem. There is still possible room to improve the UX here as far as language, but the data doesn't show that removing notifications and making failures less discoverable, will be helpful.

Applicable data: https://github.com/kubernetes/community/tree/master/sig-contributor-experience/surveys/2019#agreement-with-statements

I'm open to other thoughts, or if others see slightly different conclusions based on the data. I appreciate everyone's patience though as we attempt to make some data driven decisions :)

sftim commented 4 years ago

SIG Docs uses Prow, but right now that's almost entirely for enforcing process rather than for testing the change itself. From a SIG Docs perspective I'd welcome a GitHub Checks status that lists all the reasons why a PR can't merge, perhaps with a link to likely actions that would change that for each one. Similar to “Not mergeable. Needs approved label.” but with much more detail.

If we could in future have Prow report on warnings, those could show up in the Checks tab as well. An example of a warning that's relevant to SIG Docs is when a PR has multiple language labels. That's not a merge blocker but it's a sign for approvers to be cautious.

BenTheElder commented 4 years ago

Minor note: The Checks API is blocked on https://github.com/kubernetes/test-infra/issues/10423

Edit: I don't have the energy to continue debating this issue right now. But I will supply purely technical information that may not be well known.

mrbobbytables commented 3 years ago

looping in sig-usability to see if they have any suggestions /sig usability

alvaroaleman commented 3 years ago

While the question of posting a comment on every failure is pretty controversial, I was wondering what ppl think about "Only include jobs that ran against the latest commit" in the comment. Today the comment might contain tests against multiple versions of a PR which confuses a lot of ppl (including me). Does anyone have thoughts on that?

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

kwiesmueller commented 3 years ago

I just got pointed to this after asking if notifications could be reduced which is especially hard to work with since the GitHub mobile app. I'd love to see one comment that gets updated instead of a new mail/notification for every new failure during a single run. On the other hand I don't know other peoples mail based workflows. I regularly jump to test results myself from mail, but in most cases I land on GitHub in the end anyways.

It should not be necessary to disable notifications for GitHub just to stay sane working on any single project. But to me comments are also primarily for communicating to people and the Jobs overview provided by GitHub on the bottom is mostly enough for me.

spiffxp commented 3 years ago

Another related issue over this (it aged out) https://github.com/kubernetes/test-infra/issues/1723

It mentioned https://github.com/kubernetes/test-infra/pull/11341 was an attempt to address? But was reverted in https://github.com/kubernetes/test-infra/pull/11472 due to "lack of consensus"

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

BenTheElder commented 3 years ago

/remove-lifecycle rotten what do we think is necessary to achieve a consensus on this? do we need to send out a survey? to who? do I need to go to a contribex meeting? right now this is blocked on an arbitrary undefined bar https://github.com/kubernetes/community/issues/3621#issuecomment-771112667

cc @kubernetes/sig-contributor-experience @kubernetes/sig-testing

BenTheElder commented 3 years ago

these comments are frequently confusingly out of date, e.g:

Screen Shot 2021-03-04 at 4 10 20 PM

they're easily de-sychronized with the actual state in the github built-in status feature, which we also use.

BenTheElder commented 3 years ago

Applicable data: https://github.com/kubernetes/community/tree/master/sig-contributor-experience/surveys/2019#agreement-with-statements

re-looking at this data, I don't think there's a question clearly about test failure notifications. the question asks if the notifications are helpful without differentiating the other classes of comment we post

alvaroaleman commented 3 years ago

At the very least what I would really like to see is not keeping references to jobs that ran on outdated revisions in the comment, that is tremendously confusing: https://github.com/kubernetes/community/issues/3621#issuecomment-708551605

BenTheElder commented 3 years ago

+1, there's a similar problem with removing jobs and even though the github statuses go away correctly the failure comment will keep existing failures forever if there is no pass.

EDIT: we're likely to have many of these issues as long as we report jobs to contributors in two distinct ways, one of which involves an API we can do on the fly per-job in a reliable fashion (given an actual github API for this), the other involves racily editing a comment to try to bring it closer in-line with the status we're reporting. The comment system is much more brittle.

LappleApple commented 3 years ago

A few months ago a UX designer was looking at this issue -- did anything come of that in Contribex?

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

mrbobbytables commented 3 years ago

/remove-lifecycle rotten /lifecycle frozen

MadhavJivrajani commented 2 years ago

/remove-kind design /kind feature kind/design is migrated to kind/feature, see https://github.com/kubernetes/community/issues/6144 for more details

mrbobbytables commented 1 year ago

/assign @MadhavJivrajani

jberkus commented 1 year ago

@MadhavJivrajani is this issue still relevant? If not, can you close it? Thanks.