jenkinsci / jms-messaging-plugin

https://plugins.jenkins.io/jms-messaging
12 stars 37 forks source link

fedmsg - some messages are not registered by the plugin #206

Closed vojtechsokol closed 3 years ago

vojtechsokol commented 3 years ago

Hi,

our team uses JMS plugin to consume fedmsg messages with org.fedoraproject.prod.github.* topics (forwarded to fedmsg by github2fedmsg app).

Everything works fine, but after a while no jobs are triggered by the plugin. One hour after the last message was received, the plugin resubscribes the job to the topic and it works again for a while.

When this happens, Jenkins log contains these entries but no errors:

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker receive
No message received for the past 3600000 ms, re-subscribing for job 'fedmsg-github-opened'.

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker unsubscribe
Un-subscribing job 'fedmsg-github-opened' from org.fedoraproject.prod.github.pull_request.opened topic.

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker subscribe
Subscribing job 'fedmsg-github-opened' to org.fedoraproject.prod.github.pull_request.opened topic.

INFO com.redhat.jenkins.plugins.ci.messaging.FedMsgMessagingWorker subscribe
Successfully subscribed job 'fedmsg-github-opened' to topic 'org.fedoraproject.prod.github.pull_request.opened'.
ggallen commented 3 years ago

@vojtechsokol , this is normal behavior for the plugin.

Can I get access to the Jenkins instance to take a look?

vojtechsokol commented 3 years ago

Thank you for quick reply @ggallen By 'normal behavior' you mean that it's expected to miss some messages? The jenkins instance with affected job is here, you should be able to login there using your kerberos ticket.

ggallen commented 3 years ago

@vojtechsokol , no it shouldn't miss messages. I mean that resubscribing after 1 hour of no messages is normal, and those are the messages you are seeing in the log. That behavior and those messages are expected.

Let me take a look.

ggallen commented 3 years ago

@vojtechsokol , can you give me an example of a message that was handled and one that was missed? I don't see anything obvious yet.

vojtechsokol commented 3 years ago

@ggallen I've created short script that dumps builds of the Jenkins job and messages from datagrepper: https://gist.github.com/vojtechsokol/e427c65d541ac294d3170df7a133ac9c. When you compare the resulting files (builds-pull-only.txt and messages.txt), you can see that there are a lot of builds missing (ATM more than 50%), the job was not triggered for e.g. these pull requests (lines are in the format owner:repo:pull_nr:commit_sha):

containers:podman:7578:cf14140fee185a4441acad148a971dbeabe52076
coreos:coreos-assembler:1706:af074e097bb5b859a4fea0bf12b0212f5ee3c004
operator-framework:community-operators:2303:8181d6944e2083fe35c71d915d5a27f45db27c6a
operator-framework:community-operators:2302:9cf7ae072d4ef423147a7010ade95d72a0fa8614
operator-framework:community-operators:2301:2830757fe5022059357e7a3d28a462f2ae59761d
ralphbean:bugwarrior:749:025f76ca31e72acf3426e06501576feca055f53a
mesonbuild:meson:7716:01b9f0af08fb813f68cec6edd795f3488a78728c
containers:podman:7577:b64f93f9c5cb4fc1870d7b830b2a23a4390c6940
ceph:ceph:37072:d6d73a649d89e44890f925278433e9ab55ffa68c
ggallen commented 3 years ago

@vojtechsokol , information overload!

All I want it a job, a message that triggered, and a message that didn't. I don't see how to find the actual messages in all the data the script produces.

vojtechsokol commented 3 years ago

@ggallen examples of correctly triggered builds and corresponding messages: message: https://apps.fedoraproject.org/datagrepper/id?id=2020-7a1b1511-ed51-4657-93be-b2dc86cc25e0&size=extra-large Jenkins job build: /job/ci-dispatch-github_pull_request_opened/2186

examples of messages that didn't trigger build message: https://apps.fedoraproject.org/datagrepper/id?id=2020-036dd572-b42f-422c-a714-92030c33a7d4&size=extra-large Jenkins job build: /job/ci-dispatch-github_pull_request_opened/2185

message: https://apps.fedoraproject.org/datagrepper/id?id=2020-b572e50a-857a-4ef1-8bec-d9dfb75b3f89&size=extra-large Jenkins job build: /job/ci-dispatch-github_pull_request_opened/2184

message: https://apps.fedoraproject.org/datagrepper/id?id=2020-b5efe6bc-7e0b-4742-a726-3579362bf303&size=extra-large Jenkins job build: /job/ci-dispatch-github_pull_request_opened/2183

message: https://apps.fedoraproject.org/datagrepper/id?id=2020-84cf66e9-4953-4096-be06-cc7c46f75694&size=extra-large Jenkins job build: /job/ci-dispatch-github_pull_request_opened/2182

https://apps.fedoraproject.org/datagrepper/id?id=2020-08e6eece-4ed6-4a8c-9a06-3a49c28ca3f2&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-b82632dc-22ec-4a50-9104-6f6a64aa510a&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-fbc45335-409b-4656-99a3-7e517e48681c&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-2cc57b49-ebc2-481a-83be-15baa49f81b2&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-a5c9491b-2bad-4866-962a-939c3643a01c&size=extra-large

If the jenkins builds are not available any more due to retention policy, you can run the updated script, just don't forget to provide the jenkins url in the script.

ggallen commented 3 years ago

@vojtechsokol , I just checked your jms-messaging-plugin version and you are running 1.1.9. The current version is 1.1.18. Could you please update to that version and see if things improve? 1.9 was released over a year ago, and there have been a lot of changes since then.

vojtechsokol commented 3 years ago

@ggallen After updating the plugin to 1.1.18 it still misses some messages. builds and triggering messages: https://apps.fedoraproject.org/datagrepper/id?id=2020-b2db8982-2c47-4dfb-8f2f-8c0ea21b4a13&size=extra-large

/job/ci-dispatch-github_pull_request_opened/2217 https://apps.fedoraproject.org/datagrepper/id?id=2020-b1d06e1e-1984-4962-b3e7-0a739b3c72ee&size=extra-large /job/ci-dispatch-github_pull_request_opened/2216 https://apps.fedoraproject.org/datagrepper/id?id=2020-2d2003b7-e6be-4815-9829-960f124a533d&size=extra-large /job/ci-dispatch-github_pull_request_opened/2215 missed messages: https://apps.fedoraproject.org/datagrepper/id?id=2020-9c195ef1-e948-4cc7-92de-eb933622d731&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-a2dabf3a-be6b-4cbc-9120-86617aa03c1c&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-5c08e2d0-4608-4fa5-8474-7e7acaa90cbe&size=extra-large
thrix commented 3 years ago

@ggallen should we provide more information here pls? This is becoming a blocker for migrating to these new jobs which use github -> fedsmg as triggers ...

ggallen commented 3 years ago

There's nothing obvious I can see here. This is going to require some more in-depth investigation.

Can you establish any sort of a pattern for when messages are missed vs. when they are received?

@scoheb , any thoughts?

scoheb commented 3 years ago

@vojtechsokol contact me offline to discuss.

scoheb commented 3 years ago

actually can you please check the status of the webhook deliveries at https://github.com/ceph/ceph/settings/hooks for https://apps.fedoraproject.org/github2fedmsg/webhook ?

It is possible there may be some 404 errors.

There might be some continual data centre migration happening.

scoheb commented 3 years ago

It does seem that there may be an issue with your missed messages script. I found that build 2520, 2519, 2518 were all triggered but do not show up as matched in output.

vojtechsokol commented 3 years ago

@scoheb Indeed, there was a bug in the script, It should work properly now.

Regarding the status of the webhook - I have access only to the oamg/leapp and oam/leapp-repository and the are two failures (500 Internal Server Error) in last 150 deliveries - however I don't think it relates to our problem - those events didn't even make it to the fedmsg.

@ggallen No idea at all if there is any pattern. The only hint I have is that maybe it affects more topics with smaller volume of messages - job ci-dispatch-github_pull_request_opened that consumes messages with topic org.fedoraproject.prod.github.pull_request.opened (722 messages in last 10 days) is affected more than job ci-dispatch-github_pull_request_commit that consumes messages with topic org.fedoraproject.prod.github.pull_request.synchronize (1213 messages in last 10 days).

scoheb commented 3 years ago

@vojtechsokol @thrix @ggallen

Please see https://pagure.io/fedora-infrastructure/issue/9363

scoheb commented 3 years ago

@vojtechsokol @thrix

The problem was traced to some proxies behind the fedora hub that were not working correctly. This caused the issues you were seeing since it all depended on which of the 10 proxies you connected to in Jenkins when the job subscribed to a topic. Only 2 of the 10 had been working correctly.

This has since been fixed on the fedora infra side. All proxies are working well now.

Please see https://pagure.io/fedora-infrastructure/issue/9363

thrix commented 3 years ago

@scoheb thanks a lot for hunting this down!

vojtechsokol commented 3 years ago

@scoheb \o/ Thanks for helping to fix that!