Closed vojtechsokol closed 3 years ago
@vojtechsokol , this is normal behavior for the plugin.
Can I get access to the Jenkins instance to take a look?
Thank you for quick reply @ggallen By 'normal behavior' you mean that it's expected to miss some messages? The jenkins instance with affected job is here, you should be able to login there using your kerberos ticket.
@vojtechsokol , no it shouldn't miss messages. I mean that resubscribing after 1 hour of no messages is normal, and those are the messages you are seeing in the log. That behavior and those messages are expected.
Let me take a look.
@vojtechsokol , can you give me an example of a message that was handled and one that was missed? I don't see anything obvious yet.
@ggallen I've created short script that dumps builds of the Jenkins job and messages from datagrepper: https://gist.github.com/vojtechsokol/e427c65d541ac294d3170df7a133ac9c. When you compare the resulting files (builds-pull-only.txt
and messages.txt
), you can see that there are a lot of builds missing (ATM more than 50%), the job was not triggered for e.g. these pull requests (lines are in the format owner:repo:pull_nr:commit_sha
):
containers:podman:7578:cf14140fee185a4441acad148a971dbeabe52076
coreos:coreos-assembler:1706:af074e097bb5b859a4fea0bf12b0212f5ee3c004
operator-framework:community-operators:2303:8181d6944e2083fe35c71d915d5a27f45db27c6a
operator-framework:community-operators:2302:9cf7ae072d4ef423147a7010ade95d72a0fa8614
operator-framework:community-operators:2301:2830757fe5022059357e7a3d28a462f2ae59761d
ralphbean:bugwarrior:749:025f76ca31e72acf3426e06501576feca055f53a
mesonbuild:meson:7716:01b9f0af08fb813f68cec6edd795f3488a78728c
containers:podman:7577:b64f93f9c5cb4fc1870d7b830b2a23a4390c6940
ceph:ceph:37072:d6d73a649d89e44890f925278433e9ab55ffa68c
@vojtechsokol , information overload!
All I want it a job, a message that triggered, and a message that didn't. I don't see how to find the actual messages in all the data the script produces.
@ggallen
examples of correctly triggered builds and corresponding messages:
message: https://apps.fedoraproject.org/datagrepper/id?id=2020-7a1b1511-ed51-4657-93be-b2dc86cc25e0&size=extra-large
Jenkins job build:
examples of messages that didn't trigger build
message: https://apps.fedoraproject.org/datagrepper/id?id=2020-036dd572-b42f-422c-a714-92030c33a7d4&size=extra-large
Jenkins job build:
message: https://apps.fedoraproject.org/datagrepper/id?id=2020-b572e50a-857a-4ef1-8bec-d9dfb75b3f89&size=extra-large
Jenkins job build:
message: https://apps.fedoraproject.org/datagrepper/id?id=2020-b5efe6bc-7e0b-4742-a726-3579362bf303&size=extra-large
Jenkins job build:
message: https://apps.fedoraproject.org/datagrepper/id?id=2020-84cf66e9-4953-4096-be06-cc7c46f75694&size=extra-large
Jenkins job build:
https://apps.fedoraproject.org/datagrepper/id?id=2020-08e6eece-4ed6-4a8c-9a06-3a49c28ca3f2&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-b82632dc-22ec-4a50-9104-6f6a64aa510a&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-fbc45335-409b-4656-99a3-7e517e48681c&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-2cc57b49-ebc2-481a-83be-15baa49f81b2&size=extra-large https://apps.fedoraproject.org/datagrepper/id?id=2020-a5c9491b-2bad-4866-962a-939c3643a01c&size=extra-large
If the jenkins builds are not available any more due to retention policy, you can run the updated script, just don't forget to provide the jenkins url in the script.
@vojtechsokol , I just checked your jms-messaging-plugin version and you are running 1.1.9. The current version is 1.1.18. Could you please update to that version and see if things improve? 1.9 was released over a year ago, and there have been a lot of changes since then.
@ggallen After updating the plugin to 1.1.18 it still misses some messages. builds and triggering messages: https://apps.fedoraproject.org/datagrepper/id?id=2020-b2db8982-2c47-4dfb-8f2f-8c0ea21b4a13&size=extra-large
@ggallen should we provide more information here pls? This is becoming a blocker for migrating to these new jobs which use github -> fedsmg as triggers ...
There's nothing obvious I can see here. This is going to require some more in-depth investigation.
Can you establish any sort of a pattern for when messages are missed vs. when they are received?
@scoheb , any thoughts?
@vojtechsokol contact me offline to discuss.
actually can you please check the status of the webhook deliveries at https://github.com/ceph/ceph/settings/hooks for https://apps.fedoraproject.org/github2fedmsg/webhook ?
It is possible there may be some 404 errors.
There might be some continual data centre migration happening.
It does seem that there may be an issue with your missed messages script. I found that build 2520, 2519, 2518 were all triggered but do not show up as matched in output.
@scoheb Indeed, there was a bug in the script, It should work properly now.
Regarding the status of the webhook - I have access only to the oamg/leapp and oam/leapp-repository and the are two failures (500 Internal Server Error) in last 150 deliveries - however I don't think it relates to our problem - those events didn't even make it to the fedmsg.
@ggallen No idea at all if there is any pattern. The only hint I have is that maybe it affects more topics with smaller volume of messages - job ci-dispatch-github_pull_request_opened
that consumes messages with topic org.fedoraproject.prod.github.pull_request.opened
(722 messages in last 10 days) is affected more than job ci-dispatch-github_pull_request_commit
that consumes messages with topic org.fedoraproject.prod.github.pull_request.synchronize
(1213 messages in last 10 days).
@vojtechsokol @thrix @ggallen
Please see https://pagure.io/fedora-infrastructure/issue/9363
@vojtechsokol @thrix
The problem was traced to some proxies behind the fedora hub that were not working correctly. This caused the issues you were seeing since it all depended on which of the 10 proxies you connected to in Jenkins when the job subscribed to a topic. Only 2 of the 10 had been working correctly.
This has since been fixed on the fedora infra side. All proxies are working well now.
Please see https://pagure.io/fedora-infrastructure/issue/9363
@scoheb thanks a lot for hunting this down!
@scoheb \o/ Thanks for helping to fix that!
Hi,
our team uses JMS plugin to consume fedmsg messages with
org.fedoraproject.prod.github.*
topics (forwarded to fedmsg by github2fedmsg app).Everything works fine, but after a while no jobs are triggered by the plugin. One hour after the last message was received, the plugin resubscribes the job to the topic and it works again for a while.
When this happens, Jenkins log contains these entries but no errors: