Make kebechet respond to release tickets on all failures

tumido commented 4 years ago

Is your feature request related to a problem? Please describe. Releasing via kebechet is very convenient and straightforward when it works. When it doesn't and it's not related to permissions (user is not a maintainer and such) it's really hard to triage it.

Describe the solution you'd like Get an error message if any of the build/release steps fails.

Describe alternatives you've considered n/a

Additional context https://github.com/aicoe-aiops/categorical-encoding/issues/16

saisankargochhayat commented 4 years ago

We already mentioned in the release issue that the person trying to create the release is not a maintainer. For ex - https://github.com/thoth-station/storages/issues/2109

goern commented 3 years ago

@tumido are you good with this behavior? can we close this issue?

tumido commented 3 years ago

@goern please don't close, I don't think we understand each other here :slightly_smiling_face:

@saisankargochhayat yeah, that's true and that's precisely why I've excluded those cases in the description, see:

... and it's not related to permissions (user is not a maintainer and such)

In our case the issue was hard to triage because Kebechet failed to push the tag, since it was already released outside of Kebechet via git tag and the tag already existed, while the version in version.py was outdated (didn't match the tag). See: https://github.com/aicoe-aiops/categorical-encoding/issues/15 https://github.com/aicoe-aiops/categorical-encoding/issues/16 https://github.com/aicoe-aiops/categorical-encoding/issues/17 And finally solved here: https://github.com/aicoe-aiops/categorical-encoding/issues/19

As you can see we've been very much in blind of what's happening and kebechet didn't tell us why it failed. This ticket is precisely for such occasions of anticipated failures, not about this exact failure type. My ask here is if we can make kebechet report the status every time, in any failure case.

saisankargochhayat commented 3 years ago

From what I understand, in a scenario like this a comment on the release issue is what we want - https://github.com/aicoe-aiops/categorical-encoding/issues/16#issuecomment-735950590 Is that correct?

tumido commented 3 years ago

Well, I don't think we should pay attention to this particular cause, it's not about fancy reporting on specific narrow reason of failure. This should be about bare old school reporting for any kind of failure.

If the bot can provide any insight into what happen, it will mean the bot saved us from filing 3 more triage trial issues. A link to the job run, failed steps, log of the step, whatever - that's what I'd like to see, the "debug" data.

saisankargochhayat commented 3 years ago

So as a general principle for any exception we encounter, we do put an issue comment indicating the user, this seems like a corner case, where the release was manually created instead of using kebechet, which messed up things. But feel free to let me know if you can find anywhere else that reporting the error could be helpful to the user. Souce code link - https://github.com/thoth-station/kebechet/blob/master/kebechet/managers/version/version.py

Maybe it's a good idea, to write in the version manager's documentation stating at any point you manually release it's a good idea to ensure that the source code version string and the tag release both indicate the same version.

tumido commented 3 years ago

I think a more generic failure handling would be appreciated on the user side.

Right now I'm debugging another issue, with a different repo where the release process failed silently (the release PR has been opened, the git tag was pushed, yet no image was delivered to quay). There's no message from any of the bots on any of the issues (sesheta even closed the release issue as a success). See for yourself:

https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/24

https://quay.io/repository/aicoe/mailing-list-analysis-toolkit?tab=tags

I wasn't able to locate the Tekton pipeline responsible for that release, so I've triggered the "Deliver container image" issue pipeline for the missing image: https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/27

The build failed on some networking error (now I know that it was a networking issue, since I was able to locate the Tekton job): https://tekton-dashboard-openshift-pipelines.apps.ocp4.prod.psi.redhat.com/#/namespaces/aicoe-infra-prod/pipelineruns/aicoe-issue-qdx7b

Yet the bots are still silent on the issues. This is not about a single corner case. This is more about a generic "safety measures" e.g. I face any error, I report it. Can we make AICoE-CI do that please?

cc @harshad16

harshad16 commented 3 years ago

@tumido thanks for pointing it out, on side of aicoe-ci, we are trying to get this message to the user either on the PR or the issue opened. There are some changes to be made to get this to a state where is more convenient for the user to get more information. we will try to get these details for the user.

on the topic of kebechet, the feature that can be useful is responding to the issues of why it is stale, the reason is that the pod running the kebechet run has failed, but as it failed, there is no message relayed all the way to GitHub issue, we should plan on managing this, either by reporting error traceback to the Github issue or pr for that we would have to monitor the exceptions or via a sidecar container which responds the GitHub issue with the log of the failed main container.

tumido commented 3 years ago

@harshad16 I know it's hard to catch every possibility and I know aicoe-ci is doing its best and I'm totally rooting for you! Yet we're still pushing the limits and demands and opening new issues... :smile:

The sidecar container or something sounds like a wonderful idea (it also sounds like a lot of work)! Looking forward to the bright future :+1:

sesheta commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

goern commented 3 years ago

/remove-lifecycle stale

@harshad16 what is the status of this?

tumido commented 3 years ago

So.. Since I always wanted to learn the CI ropes, I've experimented with this when building by own CI for the OperateFirst slack bot...

I think a comment like from the bots would be enough: https://github.com/tumido/slack-first/issues/54#issuecomment-823598881

I'm updating the same comment in various stages of the CI with the most recent actions taken. It helps me understand which workflow and at which step it got stuck.

If it would be possible to have something like this for AICoE-CI, I think it would be a huge jump forward in usability.

goern commented 3 years ago

I'm all in for more chatops, as long as we keep it accessible to us Red Hats using Google Chat ;)

Shall we send out event from the CI to a Kafka topic and have different consumers send messages to slack or gchat?

goern commented 3 years ago

/priority backlog

sesheta commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

goern commented 3 years ago

/remove-lifecycle rotten /help /good-first-issue

sesheta commented 3 years ago

@goern: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/thoth-station/kebechet/issues/629): >/remove-lifecycle rotten >/help >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sesheta commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

goern commented 3 years ago

/remove-lifecycle rotten

goern commented 3 years ago

/assign goern

goern commented 3 years ago

/sig user-experience

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

harshad16 commented 2 years ago

/lifecycle frozen

thoth-station / kebechet

Make kebechet respond to release tickets on all failures #629