tcplugins / tcWebHooks

WebHooks plugin for Teamcity. Supports many build states and payload formats.
https://netwolfuk.wordpress.com/category/teamcity/tcplugins/tcwebhooks/
154 stars 30 forks source link

Webhooks configured via Kotlin sees 701 :: WebHook not enabled for buildState 'buildFinished' #248

Open radiumsurfer opened 1 month ago

radiumsurfer commented 1 month ago

Hi @netwolfuk I'm seeing a similar problem that may be related to https://github.com/tcplugins/tcWebHooks/issues/246

We're using kotlin for versioned settings, and (unless there's a better way, ideally IN kotlin would be great) configuring webhooks via configuring the file

Example Configuration (xml)

.teamcity/pluginData//plugin-settings.xml

<?xml version="1.0" encoding="UTF-8"?>
<settings>
    <webhooks enabled="true">
        <webhook url="http://<hook receiving endpoint>" enabled="true" format="json" template="none">
            <states>
                <state type="buildInterrupted" enabled="true" />
                <state type="buildFailed" enabled="true" />
                <state type="buildBroken" enabled="false" />
                <state type="responsibilityChanged" enabled="false" />
                <state type="buildSuccessful" enabled="true" />
                <state type="buildStarted" enabled="false" />
                <state type="beforeBuildFinish" enabled="false" />
                <state type="buildFixed" enabled="false" />
                <state type="buildFinished" enabled="true" />
                <state type="changesLoaded" enabled="false" />
            </states>
            <build-types enabled-for-all="false" enabled-for-subprojects="false">
                <build-type id="bt656500" />
            </build-types>
        </webhook>
    </webhooks>
</settings>

where build-type id fetched from the database table build_type_mapping

and in the UI it shows 0 in Enabled Builds image

This had been working fine until we've upgraded to teamcity 2023 to 2024, where we encountered https://github.com/tcplugins/tcWebHooks/issues/236 and upgraded to 2.0.1

Current Behavior

After applying the plugin without server restart, no hooks worked. After restarting the server, hooks for some builds seem to be working, then it stopped working. The event response for those are all

701 :: WebHook not enabled for buildState 'buildFinished'

This time another server restart does not resolve it. Again most but not all are broken, which is baffling. This is an example plugin-settings.xml of a build that does work.

<?xml version="1.0" encoding="UTF-8"?>
<settings>
    <webhooks enabled="true">
        <webhook url="http://<same endpoint as above" enabled="true" format="json" template="none">
            <states>
                <state type="buildInterrupted" enabled="true" />
                <state type="buildFailed" enabled="true" />
                <state type="buildBroken" enabled="false" />
                <state type="responsibilityChanged" enabled="false" />
                <state type="buildSuccessful" enabled="true" />
                <state type="buildStarted" enabled="false" />
                <state type="beforeBuildFinish" enabled="false" />
                <state type="buildFixed" enabled="false" />
                <state type="buildFinished" enabled="true" />
                <state type="changesLoaded" enabled="false" />
            </states>
            <build-types enabled-for-all="false" enabled-for-subprojects="false">
                <build-type id="bt656397" />
            </build-types>
        </webhook>
    </webhooks>
</settings>

I did a diff of the xmls and they are the same (except for build-type id). I have triple verified that the build-type ids are the correct ones in the db.

This one in the UI, Enabled Builds shows 1

image

The hooks configured via UI at the top level seem to be working fine.

From configuring a test hook in the UI, I noticed the format of the plugin-settings.xml has changed - also includes an id. Are new versions that fixes https://github.com/tcplugins/tcWebHooks/issues/236 backwards compatible?

I have tried turning on debug but didn't see much in the logs as to why "WebHook not enabled for buildState 'buildFinished'"

Expected Behavior

Hooks sent with 200 status

Steps to Reproduce (for bugs)

  1. Configure a kotlin project with above xml
  2. Launch build
  3. See if webhook is sent

Your Environment

Appreciate any help here, as hooks is pretty broken for our instance. Thank you

radiumsurfer commented 1 month ago

I'll also add that since upgrading, in the webhooks tab of the project where specific build was configured, there are 701 errors appears for ALL of the jobs in that project, even though this section only has one job configured

            <build-types enabled-for-all="false" enabled-for-subprojects="false">
                <build-type id="bt656397" />
            </build-types>
netwolfuk commented 1 month ago

Thanks for for the detailed information. I had a look at #246 on the weekend, but could not reproduce it. Hopefully this will help me pin it down, until I get more information around that ticket.

netwolfuk commented 1 month ago

I tried to fix a bunch of issues around reloading webhook config from files in version 2.0.1

If you say it previously worked, then version 2.0.0 might be more reliable for you. I'm not sure if it's practical for you to just try stuff though.

Hopefully once I can reproduce this, I can check if 2.0.0 has the same issue. That would at least help me narrow it down a bit.

radiumsurfer commented 1 month ago

Thank you for the prompt reponse @netwolfuk. The previous version of tcWebhooks we ran with the older TeamCity version 2023.11.3 is v2.0.0-rc.3. But obviously that didn't work with TeamCity 2024 when we upgraded so I bumped the plugin to 2.0.1 (502). We'd have to schedule downtime for a quick server restart for trying new plugins (I assume this plugin requires a restart? bumping to 502 without restart seemed problematic - hooks didn't work, hooks tab in UI seem to show a new entry of a configured hook every time our versioned settings loaded - and they would show an error about webhook id not found - a server restart seem to clear it but I think it's still happening (see screenshot - there's only one configured hook for that project)).

image image

I should be able to try 2.0.0 this weekend. By version 2.0.0 did you mean build 495, not the rc* releases?

netwolfuk commented 1 month ago

Thanks for the extra information. Yes, if you could please try build 495 I would be very grateful. I realise this is a pain and I'm so sorry.

There is an issue with TeamCity hot reloading this plugin. I think it's because I use the plugin dependency features of TeamCity. A TeamCity restart will be required after deleting the existing plugin zip files and dropping the 495 versions in.

radiumsurfer commented 1 month ago

No worries. Really appreciate the prompt responses. I'll install build 495 on Sunday morning and post an update on Monday to see how it goes.

netwolfuk commented 1 month ago

I have spent some more time on this this weekend, and am still not able to reproduce this issue :-(

I only just enabled Kotlin VCS. Maybe it needs to be more complex than I have configured.

I wonder if it would be possible to get a copy of the .teamcity folder in your project (eg, from Git). You would want to remove any sensitive information from the files. I suspect the VCS root (if there) and the URLs for the webhooks at a minimum.

radiumsurfer commented 1 month ago

it won't be practical to send you all of .teamcity as we have several thousand projects all defined in there in kotlin.

I rolled to build 495 this morning, and from a glance we will likely have the same issue.

On a build that seem to be working, I see in the UI

image

I'm not sure if intended, but if I click on Edit build webhooks for the build itself i.e. "Empty PR Check", the page does not show any hooks configured

image

However if I click on its project "PR DevOps Jobs", it does show

image

and editing the webhook itself shows that it has correctly identified the build specified

image

This is the plugin-settings.xml for that

.teamcity/pluginData/<Project name>PRDevOps/plugin-settings.xml 
<?xml version="1.0" encoding="UTF-8"?>
<settings>
    <webhooks enabled="true">
        <webhook url="http://<domain>.com:8080/route/teamcity" enabled="true" format="json" template="none">
            <states>
                <state type="buildInterrupted" enabled="true" />
                <state type="buildFailed" enabled="true" />
                <state type="buildBroken" enabled="false" />
                <state type="responsibilityChanged" enabled="false" />
                <state type="buildSuccessful" enabled="true" />
                <state type="buildStarted" enabled="false" />
                <state type="beforeBuildFinish" enabled="false" />
                <state type="buildFixed" enabled="false" />
                <state type="buildFinished" enabled="true" />
                <state type="changesLoaded" enabled="false" />
            </states>
            <build-types enabled-for-all="false" enabled-for-subprojects="false">
                <build-type id="bt656397" />
            </build-types>
        </webhook>
    </webhooks>
</settings>

However, on a build that doesn't work:

image

see that there's no entry in the build section, and clicking Edit project webhooks leads to a similar page where it shows 0 builds in Enabled builds column,

image

and similar editing the webhook itself shows that no builds are selected.

The diff between the two plugin-settings.xml for these two:

$ diff .teamcity/pluginData/<project>PRDevOps/plugin-settings.xml .teamcity/pluginData/<project>NextRsrBuild/plugin-settings.xml 
18c18
<                 <build-type id="bt656397" />
---
>                 <build-type id="bt656500" />

I get the build id from teamcity's db directly

select int_id, ext_id from build_type_mapping where ext_id='${1}'
bt656397 | <project>Pipeline_EmptyPRMaster
bt656500 | <project>Pipeline_NextRsrBuildComposite

Note that none of our webhooks configuration changed and the previous changes were all additions - the format has been the same since we've started using kotlin a few years ago.

I tried looking for errors in teamcity-server.log but the only thing I can find searching for webhook is a line like this at the beginning of starting the server

[2024-07-28 07:53:55,033]   WARN -   jetbrains.buildServer.SERVER - Unknown tag in plugin-settings.xml for project with internal id 'project100669' (will ignore): webhooks

this line is repeated for what appears like a few hundred project ids. We only have 40 some plugin-settings.xml configured, most with 1 or 2 builds, so I'm not sure if it's relevant.

So, trying to summarize. It looks like if the hook is able to identify/index the build we're interested in from plugin-settings.xml, the hook works. (Despite some 701 :: WebHook not enabled for buildState 'buildFinished' events generated for other builds that we're not interested in the same project). However, if in the ui it looks like the hook wasn't able to match to the build, it does not work.

Interestingly on our staging instance, which I've previously exported/imported the prod db to where the build ids remained the same, both the example builds is actually correctly identified and enabled. However, another build in another project doesn't work. (using webhooks version 502). I did a tally, it seems about 15% of the hooks are working on both servers.

Could it be resourcing/caching? I'm not sure if it's practical for you to configure say a few dozen hooks to try out.

Is there some debug logging I can turn on for just webhooks?

netwolfuk commented 1 month ago

if I click on Edit build webhooks for the build itself i.e. "Empty PR Check", the page does not show any hooks configured

This is a bug I was not aware of. I can reproduce this bug. I'll look at fixing that, but arguably not as important as the other issue.

netwolfuk commented 1 month ago

Another view that might be helpful is the search view. /webhooks/search.html?show=all You can see all the webhooks that that plugin has in its cache.

My gut feel is that this is a caching issue. I maintain a cache of the webhooks, so that I can quickly build the UI and quickly find webhooks from parent projects by walking the tree.

I will read and re-read your message as it's loaded with lots of useful information and will draw myself a picture off your config.

Thank you so much for all the work you're doing on this. I was upset to read that only 15% of your webhooks are working, and will keep digging into it.

I'm not sure if it's practical for you to configure say a few dozen hooks to try out.

Yes, I will do that.

Is there some debug logging I can turn on for just webhooks?

The logging is to the SERVER log4j tag. This means turning on debug all will give you a lot of information about everything including tcWebHooks. So that might not be practical for you.

radiumsurfer commented 1 month ago

I'm getting this error when looking at webhooks/search.html?show=all

image

The webhooks admin page shows

image

but the number doesn't look right we shouldn't have 100s of hooks.

does the cache clear upon server restart?

the endpoint does work on our staging instance though.

image image

not sure why the discrepancy here is? 70 vs 42?

image

it does provide a good idea/view of what hooks are working (1 build in enabled column) vs not (0 build).

and just fyi, we're running a multi-node setup - teamcity designates a primary node that should be for writing changes, but while sharing teamcity's data dir with all nodes. I'm not sure if it would impact webhooks plugin, but we've had other plugins misbehave until it was updated to be disabled on secondary nodes.

netwolfuk commented 1 month ago

I'm getting this error when looking at webhooks/search.html?show=all

That data is built from the configs. I can't think of a reason that it would ever be null. Unless the project is deleted after the server has started up, but that seems unlikely that you'd do that.

but the number doesn't look right we shouldn't have 100s of hooks. does the cache clear upon server restart?

Yes, the cache is just in memory. It can't persist across restarts.

not sure why the discrepancy here is? 70 vs 42?

Oh, that is very strange. I'll double check where that info comes from. It should be the same place, but those pages were written about 10 years apart. so maybe I am not consistent.

it does provide a good idea/view of what hooks are working (1 build in enabled column) vs not (0 build).

Do you think it's the same 15% of webhooks working across the prod and staging servers?

and just fyi, we're running a multi-node setup

oh! I hadn't thought of that. I've never tested on a multi-node setup. I'll try that. I might have to reach out to the guys at TeamCity and get a license for that.

Are you running build 495 in staging? I must admit I was really hoping it would resolve itself with the older version, but I wonder if it has written all these extra webhooks into your config, and now whatever version we run will just try to re-create them.

I did notice your webhook has no id defined. I suspect this webhook has not been edited in the UI for many years.

<webhook url="http://<domain>.com:8080/route/teamcity" enabled="true" format="json" template="none">

Is there any webhooks that do have ids? Perhaps those the are the ones that are working?

<webhook id="id_552375210" url="http://localhost:8111/webhooks/endpoint.html?vcs_webhook=1" enabled="true" template="velocity_escaping" hide-secure-values="true">

I will need to test what the code does when no ID is defined. For me it appears to be re-written by the VCS module. I'm not sure why that is not happening for you.

netwolfuk commented 1 month ago

I just thought maybe you don't allow TeamCity to write back to VCS?

Can you tell me what these settings are set to? Specifically the Allow editing project settings via UI and associated values.

image

netwolfuk commented 1 month ago

I think I have been able to reproduce this.

This is really more of a brain dump than anything at this stage.

  1. Creating webhooks from the UI with sync to VCS turned on creates ok.
  2. Somehow then the build specific block <build-type id="bt10" /> gets dropped on subsequent VCS edits

Also. With VCS sync turned off, creating webhooks creates these "ghost" webhooks. Some of which are editable. Some are not. WebHooks in the UI show with builds enabled, even though the config does not have any configured. It appears to inherit the most recent setting. I was wrong here.

In this screenshot, number 4 is in there twice. The first number 4 cannot be edited in the UI (loads empty and errors),

image

<?xml version="1.0" encoding="UTF-8"?>
<settings>
  <webhooks enabled="true">
    <webhook id="id_552375210" url="http://localhost:8111/webhooks/endpoint.html?vcs_webhook=1" enabled="true" template="velocity_escaping" hide-secure-values="true">
      <states>
...removed for brevity...
      </states>
      <build-types enabled-for-all="false" enabled-for-subprojects="false">
        <build-type id="bt10" />
      </build-types>
    </webhook>
    <webhook id="id_971147615" url="http://localhost:8111/webhooks/endpoint.html?vcs_webhook=2" enabled="true" template="legacy-json" hide-secure-values="true">
      <states>
...removed for brevity...
      </states>
      <build-types enabled-for-all="false" enabled-for-subprojects="false" />
    </webhook>
    <webhook id="id_678677915" url="http://localhost:8111/webhooks/endpoint.html?vcs_webhook=3" enabled="true" template="legacy-json" hide-secure-values="true">
      <states>
...removed for brevity...
      </states>
      <build-types enabled-for-all="false" enabled-for-subprojects="false" />
    </webhook>
    <webhook url="http://localhost:8111/webhooks/endpoint.html?vcs_webhook=4" enabled="true" template="legacy-json" hide-secure-values="true">
      <states>
...removed for brevity...
      </states>
      <build-types enabled-for-all="false" enabled-for-subprojects="false">
        <build-type id="bt10" />
      </build-types>
    </webhook>    
    <webhook url="http://localhost:8111/webhooks/endpoint.html?vcs_webhook=5" enabled="true" template="legacy-json" hide-secure-values="true">
      <states>
...removed for brevity...
      </states>
      <build-types enabled-for-all="false" enabled-for-subprojects="false">
        <build-type id="bt10" />
      </build-types>
    </webhook>    
  </webhooks>
</settings>

At least it feels like I am making progress.

netwolfuk commented 1 month ago

Also, webhooks/search.html?show=all only shows one. If I remember correctly, that is the only webhook created via the UI.

image

radiumsurfer commented 1 month ago

sry I forgot to mention, we don't allow writebacks to VCS (using 3rd option - all changes made via code - no UI changes allowed) - teamcity disables all build configuration edits. I haven't attempted to make any webhooks changes from the UI - since webhook configs are made directly via the plugin-settings.xml (in vcs), I'm not sure how it would work if you did allow changes from the UI.

image

That data is built from the configs. I can't think of a reason that it would ever be null. Unless the project is deleted after the server has started up, but that seems unlikely that you'd do that.

I definitely haven't deleted anything manually. I've configured our staging and prod as similar as possible with the difference being that the staging builds are only run manually, and prod has sandboxes (see structure at the end of this message). I've restarted both instances numerous times so based on that there shouldn't be anything in memory cache. I'm wondering if there's anything from TeamCity's data dir cache that can mess this up. Search in .BuildServer/system/caches for plugin-settings.xml turned up empty.

Do you think it's the same 15% of webhooks working across the prod and staging servers?

I can't tell what % of hooks are working on prod because the webhooks/search.html?show=all page crashes. On the staging instance, the page shows 13/38 hooks (from kotlin versioned settings) is enabled (1 build in enabled column, so that's around 34% working on the staging instance. From manually looking on the production instance, there's some builds that are working that are same as the staging, but it's not a complete overlap. Some are enabled on prod but not on stage, and vice versa. Example of this in message above.

Are you running build 495 in staging?

No it's 502 on staging.

I did notice your webhook has no id defined. I suspect this webhook has not been edited in the UI for many years.

Correct, this webhook format has been carried over from an older webhooks configuration in the UI. Ideally there's a kotlin module that allows us to configure the hooks from kotlin so we don't have to mess with xmls, but I'm guessing that's probably a version 3.0 kind of change for webhooks.

In this screenshot, number 4 is in there twice. The first number 4 cannot be edited in the UI (loads empty and errors),

An anecdote that I noticed that if I install and enable the plugin without a server restart, I would see multiple webhook entries in the webhooks tab view, from what appears like each time someone merged a change (unrelated) which causes the versioned settings to load again I would see a extra dup entry, and there would be another one upon a new sha and versioned settings loaded again. If I recall correctly, only one of them has the Enabled build = 1.

I'm also not sure if related, sometimes if I edit a hook that has Enabled build = 1. In this case, for the working one above "PR DevOps Jobs", I see this

image

but it doesn't always happen. This worked yesterday after I restarted the server, so I'm not sure if it gets into this state after versioned settings is loaded again.

Is there any webhooks that do have ids? Perhaps those the are the ones that are working?

Not in our kotlin codebase. I just checked we have 1 hook at the top level (outside of versioned settings) that's configured to send a hook for every single build, that one was migrated from the old instance as well and it also does not have an id. I did notice the new version produces an id when I played with it on the UI via a separate project, but I'm not sure how it impacts how webhooks works, or what value I should provide if I was provisioning them via xml.

Our versioned settings repo structure is something like this

.teamcity/ src/ main/ com/etc/etc Main.kt defines initializeSeedProject() test/ settings.kts pluginData/ <project 1 name id>/plugin-settings.xml <project 2 name id>/plugin-settings.xml <project 3 name id>/plugin-settings.xml

content of settings.kts

version = "2024.03"
initializeSeedProject()

initializeSeedProject is something like

import jetbrains.buildServer.configs.kotlin.BuildType
import jetbrains.buildServer.configs.kotlin.project
import jetbrains.buildServer.configs.kotlin.Project

internal fun initializeSeedProject() {
  project(Project() {
    name = "Test project"
    build(BuildType() { name = "Test Build" })
  })
}

Our teamcity instance project structure is this

Root project/ Product/ <- there is a hook defined at this level that sends all hooks - this one appears to be working correctly Pipelines [Kotlin]/ <- main production project. everything under here is code-based, all changes done via code, UI editing not allowed Pipelines [Sandbox]/ User1/ <- users have the ability to render a subproject (managed via kotlin code) of the main pipeline to test changes etc. Due to the project structure, I believe every sandbox also picks up the pluginsData folder. However the hooks would not work because the build ids defined in plugin-settings.xml are all of the "prod" build, so they would not match any build types in sandboxes. I'm not sure if this would affect how webhooks handles them because the hooks do not have ids. Now that I've mentioned this, this is probably the reason where those 630 hooks are from on our prod instance. It is counting those in the sandboxes, even though they don't match to any builds within it. Staging does not have sandboxes. Could this be the reason the view all page is crashing? User2/

It does look like we were using tcWebHooks-1.1.374.403 with teamcity 2023 (yes sry 4 years guilty. just kept what has been working :)

netwolfuk commented 1 month ago

Thanks @radiumsurfer This explains a lot.

It does look like we were using tcWebHooks-1.1.374.403

I would have to look through the code, but I suspect this version did not use the in-memory cache at all. The webhooks are stored in the cache for quick lookup via the id. When a webhook is loaded with no id, one is autogenerated. When webhooks are written back to the plugin-settings.xml this id is written too.

So, the scenario we have is that:

  1. Every time the project settings are reloaded from VCS, a new webhook is being created with a new id.
  2. These ids are never sync'd back to the VCS since that is disabled.
  3. There must be a bug where old webhooks with previous IDs are not removed from the cache index correctly.

I suspect the old webhooks are removed from part of the cache, which is why they are not found when we try to edit them (and don't appear to be counted in some scenarios).

I would recommend the first thing we do is set IDs on your webhooks. I have just tested setting it in the VCS and they can be any string (as long as it works in XML and HTML) but they must be unique across the teamcity instance. If they clash, they will be auto-regenerated and we will be back in the same position again where the ID is out of sync with the settings in VCS.

I would therefore recommend an id based on the project id. Eg, if your Project Id is FooBarPRDevOps you could give the webhooks ids in the file like this: <webhook id="FooBarPRDevOps_Webhook_1" url=...

Bear in mind that this is an ID change, so you will get one more ghost webhook until the server is rebooted. Here is my example one....

    <webhook id="VcsSettingsProject_Webhook_1" url="http://localhost:8111/webhooks/endpoint.html?vcs_webhook=6" enabled="true" template="velocity_escaping" hide-secure-values="true">

Once webhooks have a consistent id defined, the ghost webhooks should resolve themselves on the next TeamCity restart. Technically, they shouldn't be happening now, but I will need to find that issue and release a new version to fix it.

Ideally there's a kotlin module that allows us to configure the hooks from kotlin so we don't have to mess with xmls, but I'm guessing that's probably a version 3.0 kind of change for webhooks.

Yeah, this is quite a significant change, as it requires rewriting all the internals for how webhooks are stored. I'm also not sure how the settings would be written for non-vcs projects, and I'd need to figure out how to support people when I can't just say.... "what does plugin-settings.xml look like?

Now that I've mentioned this, this is probably the reason where those 630 hooks are from on our prod instance. It is counting those in the sandboxes, even though they don't match to any builds within it. Staging does not have sandboxes. Could this be the reason the view all page is crashing?

I'm going read this a few times to see if I can figure out whether this would be affecting it. This also feels like a bug, but I struggling with the mental model, and will need to draw some pictures :-)

Thanks again for your perseverance and detailed reporting. It feels like we're getting there slowly.

radiumsurfer commented 1 month ago

Oops I just notice the spacing is trimmed in the project structure formatting let me try that again as code block (with some updates)

Our teamcity instance project structure is this

Root project/ <- versioned settings enabled here to a different repo as a backup, using xml 
    Product/
        Pipelines [Kotlin]/
            Subproject 1 (org/main)
            Subproject 2 (org/main)
        Pipelines [Sandbox]/
            User1/
                Subproject 1 (fork/branch)
            User2/
                Subproject 2 (fork/branch)

Pipelines [Kotlin]/ <- main production project. everything under here is code-based, all changes done via code, UI editing not allowed. However there is a hook defined at this level that sends all hooks (provisioned from the webhooks UI tab) - this one appears to be working correctly, however, looking at the versioned settings xml on the server, it doesn't have the id either - this was provisioned with a previous webhooks version - but I guess the new version didn't write it back with an id.

Essentially each sandbox user is a separate versioned settings enabled project looking at a fork/branch of the production versioned settings repo. We have code to narrow down the number of projects being rendered since everyone is working on a different piece. However since the pluginsData dir isn't kotlin, it is taken in and rendered for every sandbox. The numbers don't quite add up though. our entire pluginsData has ~40 hooks, but have way more than 15 sandboxes which would ~600 hooks...

In any case I will add ids to the hooks and try it on staging server to see what improves.

Also wondering if you're aware or know why I would see this entry in the logs, or if it's any concern.

[2024-07-28 07:53:55,033]   WARN -   jetbrains.buildServer.SERVER - Unknown tag in plugin-settings.xml for project with internal id 'project100669' (will ignore): webhooks

I've also noticed provisioning a hook with the new webhooks version, there's a few differences.

4c4
<         <webhook url="http://co.com:8080/route/teamcity" enabled="true" format="json" template="none">
---
>         <webhook id="id_417396340" url="http://co.com:8080/route/teamcity" enabled="true" template="legacy-json" hide-secure-values="false">
5a6,13
>                 <state type="buildAddedToQueue" enabled="false" />
>                 <state type="buildRemovedFromQueue" enabled="false" />
>                 <state type="buildPinned" enabled="false" />
>                 <state type="buildUnpinned" enabled="false" />
>                 <state type="testsMuted" enabled="false" />
>                 <state type="testsUnmuted" enabled="false" />
>                 <state type="reportStatistics" enabled="false" />
>                 <state type="serviceMessageReceived" enabled="false" />

in the webhook object, should I also update the template, add hide-secure-values, and remove format attributes? What about the new state objects? add them?

netwolfuk commented 1 month ago

Also wondering if you're aware or know why I would see this entry in the logs, or if it's any concern.

That's not my code printing that. Are you able to track down that file and post it? From what I can tell, it will be ignoring any webhooks from that (eg, missing in TeamCity).

in the webhook object, should I also update the template, add hide-secure-values, and remove format attributes? What about the new state objects? add them?

It will default to sane values. I think id really is the only one we need. format can be removed. It is ignored because format is now an attribute of template, not the webhook.

Thanks for the extra layout. That will help me to reason it out. I'll setup some nested projects to try to reproduce it.

netwolfuk commented 1 month ago

However there is a hook defined at this level that sends all hooks (provisioned from the webhooks UI tab) - this one appears to be working correctly, however, looking at the versioned settings xml on the server, it doesn't have the id either - this was provisioned with a previous webhooks version - but I guess the new version didn't write it back with an id.

It's interesting that this one doesn't ghost. Maybe you don't tend to make VCS changes often that affect this project?

radiumsurfer commented 1 month ago

That's not my code printing that. Are you able to track down that file and post it? From what I can tell, it will be ignoring any webhooks from that (eg, missing in TeamCity).

All our plugin-settings.xml like the ones I posted above has the webhooks tag in them. Is it not correct/needed anymore?

<?xml version="1.0" encoding="UTF-8"?>
<settings>
    <webhooks enabled="true">
        <webhook url="http://<domain>.com:8080/route/teamcity" enabled="true" format="json" template="none">
            <states>

It's interesting that this one doesn't ghost. Maybe you don't tend to make VCS changes often that affect this project?

Hmm my bad about statement saying it's provisioned in the UI. It looks like this hook is also defined in our kotlin repo with the pluginData/_Self folder, which seem to apply to the root project that has the versioned settings enabled (where all its subprojects are rendered via kotlin). The project itself was created on the UI, but the code could define objects in current project as well (kotlin code adds vcs roots to it etc).

That explains the lack of ID. But this hook seem to behave correctly (and showing correctly on the project pages), until I click edit the hook, where I'm met with the A WebHook with that ID was not found in this project. and greyed out selections on the trigger events.

image

I will add IDs to all of them and report back. Thanks for the tips

netwolfuk commented 1 month ago

All our plugin-settings.xml like the ones I posted above has the webhooks tag in them. Is it not correct/needed anymore?

That's still required. Those look correct other than the missing id and the superfluous format.

radiumsurfer commented 1 month ago

So i've added ids to all the hooks, and loaded it on our staging instance. I think I'm no longer getting the A WebHook with that ID was not found in this project message when I edit a hook. However, I still have 5/34 hooks that seem to have a matching build, while 29/34 does not.

image

netwolfuk commented 1 month ago

I need to dig into that page some more. I have found it to be unreliable after loading WebHooks from VCS config. Mine also says 0 builds, and is even completely missing a sub-project that is maintained by VCS.

I need to check if the code to find webhooks for execution uses the same algorithm as this page. If it does then those webhooks won't be being executed either. D'oh!

image

Thanks so much for all the detailed information. I have a lot to sift through. Please do keep sending anything that might be relevant. It all helps piece together the puzzle.

netwolfuk commented 1 month ago

I've managed to make some really good progress on this. I have found the following:

Updates to webhooks that are driven by the plugin work well The cache is maintained correctly because the changes are applied to the settings, and then the cache is updated.

Updates driven by VCS changes without IDs cause ghost webhooks This is because the existing webhooks are not updated because they are not found due to the ID being regenerated.

Also, I was not comparing the result of the existing webhooks against the updated webhooks. The only predictable way to compare them is by ID, but newly loaded webhooks will be assigned an ID if they don't have one defined. This will not match the existing one, so the old webhook is not updated. A new one is created.

I have added logic to compare the IDs after the cache is refreshed, and remove any webhooks from the current project if they don't match one of the new IDs. A line like the following is printed in the log.

[2024-08-05 09:16:35,341] INFO - jetbrains.buildServer.SERVER - WebHookSettingsManagerImpl :: Found the following webhooks in the cache that no-longer map to webhooks in project 'VcsProjectTest'. WebHooks with ids '[id_377052579, id_283758663]' will be removed from the cache.

This does have the downside that IDs will churn. This causes two annoyances.

  1. The history page will link webhook events with webhooks that might no longer be relevant.
  2. If the edit webhooks page is loaded, and then the webhooks change, the edit will fail because the webhook cannot be found.

Ideally, webhooks should have unique IDs. This will work around this issue.

TeamCity's buildTypeUnregistered event is poorly named In 2.0.1, I added a listener for the buildTypeUnregistered event. This allows me to cleanup webhooks that might be interested in a build that has been deleted.

However, I have found that this event is fired when the VCS configuration is reloaded. TeamCity appears to re-read the VCS settings, delete all builds on that project, and then recreate them with the same internal and external IDs. At which point the buildTypeUnregistered fires for each build.

When I see that event, I diligently clean up any webhooks that explicitly reference that build, and remove that build (btxxxx) id from their configuration.

This presents as webhooks having 0 builds registered. Sometimes webhooks still have builds showing. This is because a project edit (either via VCS or from the UI) fires up to 10 events in quick succession. I have a queue backed delay and de-bounce algorithm to only present one of each type of event.

I therefore only see one buildTypeUnregistered and one projectRestored, but they will fire at almost the same time. At which point a race condition exists as to whether the cache is refreshed while the build is marked as deleted or has been "undeleted".

I have removed the buildTypeUnregistered event listener, and builds now reliably show for all webhooks. I will do some further testing to make sure that not listening for this event has no problematic side-effects.

I'll do more testing on various configurations. I hope to have a very early beta release of 2.0.2 available for testing in 24 hours.

netwolfuk commented 1 month ago

FYI @matt-richardson

netwolfuk commented 1 month ago

BTW. It's possible to get a bt ID from the REST API. http://teamcity:8111/app/rest/buildTypes/TcPlugins_TcWebHooksLocal?fields=id,internalId

<buildType id="TcPlugins_TcWebHooksLocal" internalId="bt9"/>
netwolfuk commented 1 month ago

Builds on this branch are available to download on the jetbrains TeamCity instance. You don't need an account to download. Just choose guest mode.

Would it be possible to install this on your staging server and test it? I am pretty confident this will fix the issues you are seeing. If you do see further issues, I would appreciate any information you could share.

Thanks again for all your help thus far.

radiumsurfer commented 4 weeks ago

hi @netwolfuk sry I been busy with some other urgent tasks. Just got around to loading this on our staging instance, and restarted the server, and at first look from the /webhooks/search.html?show=all page, it still looks like 5/34 are "identified" while 29 show 0 build matched. This appears to be the same as what I saw with the previous webhooks build on this server which is 495.

image image

I double checked with the internalId, and it indeed still matches what we had in the configuration file, which I've updated with IDs. (Thanks for the Rest API call. It's much more convenient than going into the db)

btw, I think the project links in the edit webhooks page is broken (at least to me). Does it work for you?

image image
netwolfuk commented 4 weeks ago

hi @radiumsurfer . Thanks for coming back to me with the updates. It looks like I still have some work to do to reproduce the issue with the builds not being found.

I thought I had found it when I found those webhooks being updated on the build Delete event.

btw, I think the project links in the edit webhooks page is broken (at least to me). Does it work for you?

Yes you're right! If you have the tcWeHooks-REST plugin install (which you do) and edit the webhooks from the "Webhooks and Templates" tab inside the project edit page, then the links are correct, but from the original edit page (which you have screenshotted), the links are wrong. I will fix that. Thanks for raising it!

I will make some changes to the logger so that we can log debug events just for the tcWebHooks. I think I need to see exactly which debug logging you are getting when the project is reloaded from VCS, so that I can reproduce what you're getting.

My unit testing was great, but when I actually looked really hard at the logs on my server, is when I found that the delete event was messing with the existent webhooks.

netwolfuk commented 3 weeks ago

Hi @radiumsurfer I have pushed some logging updates which log all tcWebHooks events into a specific logging package. This means that I can now separate out the tcWebHooks log events into a different log level and a different file.

The RC2 version should still log to teamcity-server.log as before, but it's now possible to enable debug logging just for the plugin.

Assuming you are running normal logging in TeamCity, I have uploaded a debug-tcwebhooks.xml which enables debug for tcWebHooks and logs to both teamcity-server.log and teamcity-tcwebhooks.log files.

Adding a new log configuration is fairly simple. Download the debug-tcwebhooks.xml file and then go to /admin/admin.html?item=diagnostics&tab=log4j in teamcity. At the bottom is a button labelled Upload new file and you can upload the logging configuration.

image

Then on the main diagnostics tab, you can select the uploaded logging... image

This will produce quite a lot of logging, and you will probably want to redact some of the entries. I'd really like see the log output from when it reloads the settings from the VCS. If you'd prefer, you're welcome to email the redacted log straight to me. My username here is the same as on gmail.

netwolfuk commented 3 weeks ago

btw, I think the project links in the edit webhooks page is broken (at least to me). Does it work for you?

FYI, I just pushed an update that fixes these broken links. Thanks for reporting it.

radiumsurfer commented 3 weeks ago

hi @netwolfuk, well I'm really not sure what's going on anymore 😕

so here's what I did on our staging instance:

since I last loaded the webhooks, I had to use the staging instance for another task to experiment with our multi-node setup. I failed over to a secondary node and designated that as the main node. I did not realize I hadn't set it back to the former primary node (let's say physical server 1 - not sure if this is significant for anything, but thought I'd jot down the detail so the rest of my actions makes sense)

I applied the debug log template, the debug-tcwebhooks option shows, and I select that. The UI still performs normally (just being processed by physical server 2). The only difference might be that the logging is specific to the node, and for plugins that only operate on the primary node, logs would only be on the primary node, which physically is now the former secondary node (physical server 1). In any case, I was observing on the former primary node (physical server 1), and did not see teamcity-tcwebhooks.log, so I thought maybe since we're not running any builds here, it's not logging - I tried reloading versioned settings - this is the part we want logging on - how the hooks are loaded - this is when I encounter an error. Teamcity is complaining that it is unable to find a previous revision or something. This isn't related to webhooks. We've seen this the odd time here and there before but I don't quite understand this and no amount of reload fixes it. The only way to fix it is to delete the project and recreate it.

So I delete and recreate the entire versioned settings project - still no logs - no new logs related to webhooks in teamcity-server.log either. At this point I thought it might need a restart, so I restarted teamcity server on physical server 1. After the server came back and the logs show that plugins didn't load, I realize I was on a secondary node, so I went to look on server 2 (current main tc node)... but didn't find the tcwebhooks.log there either. One thing I did notice was that the administration page on the UI was missing the WebHooks links - I think since the plugin doesn't work with multi-node, the new failed over (primary) node would probably need a restart for it to show up)...

I failed over the primary node back to physical server 1 and continue to look for those logs. Nothing. I restart teamcity on server 1 again, nothing.

Since the plugin loaded and I now can see the WebHooks link in the admin page, I went in and to my surprise, all hooks seem to be working - all the hooks with specific builds have the build ids identified and attached.

Just to test things out - I pointed versioned settings sha to a revision that did not have my additions of the IDs in the plugin-settings.xml - they still seem to have loaded fine - no complaints about Ids even if I went into to edit the hooks, even after I reloaded versioned settings again - still seem to load all hooks now.

So, between not realizing that we had a different primary node, and the versioned settings getting out of whack that I needed to delete and recreate the project, I'm not really sure what "fixed" it.

If it's caching in the server memory, I certainly have restarted the old primary node many times and we still saw the problem. If it's caching on the server disk, I would expect the problem to come back after we failover back to physical server 1. Perhaps the most plausible is caching on disk, either local or share dir, but has some association with the project, such that it was cleared when I deleted the project.

I'm thinking this leaves me with a narrow path forward - If you're not able to find a place/fix in the plugin where say you can control/influence whatever cache there is, I think I will start with trying to do a failover/failback during the next server maintenance window for our prod instance. If that doesn't fix it, I will have to tread carefully about removing the entire project as we don't want to lose build history. On the staging instance, despite deleting the project, recreating it with the same ID right away seem to retain build history or teamcity made a smart move to reattach them. Hope this is consistent, but I will need to check with jetbrains to get some kind of reassurance that we will not lose build history before proceeding with this.

As for which version of the plugins-settings.xml to use, I think I'll try it with what we have now (no IDs), to see if it would still work (just to keep variables consistent in the interest of finding out what fixes the issue). If it does (or doesn't) I'll still update it.

What do you think?

radiumsurfer commented 3 weeks ago

Regarding why the logging preset doesnt work I might add that our servers are airgapped/does not have internet access and I noticed some external links in the preset file so I'm not sure if that could be the reason..

netwolfuk commented 3 weeks ago

Wow, what an exciting weekend!

As for which version of the plugins-settings.xml to use, I think I'll try it with what we have now (no IDs), to see if it would still work (just to keep variables consistent in the interest of finding out what fixes the issue). If it does (or doesn't) I'll still update it.

My gut feel is that the code will keep recreating all the hook with new IDs if there are none defined. This is ok if you're happy with the IDs changing, and not matching with the ones in the history page.

If at all possible, I would recommend having the IDs in the plugin-settings.xml file. This will stop that, but I understand your concern that you would rather go back to a known state first.

Hope this is consistent, but I will need to check with jetbrains to get some kind of reassurance that we will not lose build history before proceeding with this.

Yes, I would definitely reach out to their support team and check that removing and re-creating will keep all the build and audit history. That would be a huge blow if that didn't line up.

Regarding why the logging preset doesnt work I might add that our servers are airgapped/does not have internet access and I noticed some external links in the preset file so I'm not sure if that could be the reason..

Thank you for pointing that out. I will look more closely at it. I simply copied the one from my server and added the half dozen new lines to enable the tcWebHooks logging.

In other news, I have built up two new test VMs, and requested a full enterprise licence so that I can run a full primary/secondary setup.

I am writing bash scripts which generate a full 5000 build configuration kts file, and will query the REST API to populate the bt numbers in the XML files. I am really feeling your pain here with maintaining the plugin settings file!! I have raised a request with Jetbrains to explain to me how I can support plugin-settings.xml and settings.kts

I have the following that I am building up. I have 10 projects configured but will run the loop until I have 1000. I am manually setting the bt numbers, but will automate that piece.

- _Root project (normal settings)

    VCS Test project
        Settings in Kotlin
        One webhook configured with sub projects enabled

        VCS Test Subproject 0001
            Settings stored in parent project settings.kts
            Three webhook configured all set to just one build

            Five Builds configured

repeated 999 more times

        VCS Test Subproject 1000
            Settings stored in parent project settings.kts
            Three webhook configured all set to just one build

            Five Builds configured

Is that an accurate approximation of your setup? Are all your projects in a single KTS file?

netwolfuk commented 3 weeks ago

BTW. I had a look at the code and the search page and the webhooks page call the same method, so those numbers should equate.

I suspect that the webhook cache was taking a long time to rebuild, possibly due to the discovery that I was recreating the webhooks for every build on every project when the settings.kts was refreshed. The latest two builds of tcWebHooks have fixed that.

netwolfuk commented 3 weeks ago

I'm intrigued that the duel node configuration is adding so much complexity. I'm looking forward to being able to test that configuration.

netwolfuk commented 3 weeks ago

I had another look through that logging configuration file. I can't see anything in there that would require a network connection.

The names with all the dots in them (jetbrains.buildServer) are references to how the java class files are grouped into packages in TeamCity. This allows the redirection of the logging from certain packages into certain files.

netwolfuk commented 2 weeks ago

I have scripted the creation of 300 projects, each containing 5 builds. I tried for 1000 projects, but the VCS was taking more than 120 second to compile the Kotlin code.

Each project has 3 webooks configured. Somewhat annoyingly, it seems to be working perfectly.

You can see in the following screenshot, that there are 300 projects (search at the bottom), and 902 webhooks. One for _Root, one for the VCS project parent, and then 3 for each of the 300 projects.

image

Hopefully by the end of Monday (German time) I will have an enterprise licence, and I can setup the primary/second node to test it.

netwolfuk commented 2 weeks ago

FYI, here is the script in case you can find it useful for generating/validating the plugin-settings.xml file. My projects are named with a number 0001 through 0300, so it's fairly easy for me to swap out the values.

#!/bin/bash

# NOTE: Requires "jq" to be installed to parse the JSON from the REST API and extract the intenalId.
# NOTE: Expects to read a template file called template-plugin-settings.xml from same dir as script.
# NOTE: Expects .teamcity dir (containing settings.kts) to be in same dir as script.

# Set your TeamCity server hostname. Expects the REST API to be at $TEAMCITY_BASE_URL/app/rest
TEAMCITY_BASE_URL="http://teamcity-test-main:8111"
# Create yourself a token in your profile in TeamCity
TEAMCITY_TOKEN="eyJ0eXAiOiAiV.....Q5LWWQcqvrdfGY0OWM3NzdmYzg5"

# let the script figure out where it's running so that it knows where to 
# find the template and the .teamcity dir inside the GIT repo.
MY_SCRIPT=`realpath $0`
PARENT_DIR=`dirname $MY_SCRIPT`

# Iterate over every number between 1 and 300.
for NUMBER in `seq 1 300` ; do 

    # convert "x" into "000x"
    PADDED_NUMBER=$(printf '%04d' $NUMBER)

    # location into which to create the plugin-settings.xml file.
    DEST_DIR=$PARENT_DIR/.teamcity/pluginData/VcsSubProject$PADDED_NUMBER 

    # Make the directory if required
    mkdir -p $DEST_DIR

    # Fetch the bt number for the build type with external id "VcsProjectTop_VcsSubProject000x_TcDummyDeb"
    BT_NUMBER=$(curl -s -H "Authorization: Bearer $TEAMCITY_TOKEN" \
        -H "Accept: application/json" \
        $TEAMCITY_BASE_URL/app/rest/buildTypes/VcsProjectTop_VcsSubProject${PADDED_NUMBER}_TcDummyDeb?fields=internalId | jq '.internalId')

    # Print the btnumber. This is actually quite useful to see progress.
    echo $BT_NUMBER

    # Read the plugin-settings.xml file, replace "0001" from the template with "000x" for our iteration,
    # and replace "bt1" with the value of $BT_NUMBER which we fetched from the REST API
    # and then write it to the plugin-settings.xml in the "000x" directory.
    cat $PARENT_DIR/template-plugin-settings.xml | sed "s/0001/$PADDED_NUMBER/g" | sed "s/\"bt1\"/$BT_NUMBER/g" > $DEST_DIR/plugin-settings.xml

done
radiumsurfer commented 2 weeks ago

The project setup is a good approximation of the top level setup. We'd have more depth of varying levels for subprojects. We have one settings.kts file like so

import com.company.product.seedProject.initializeSeedProject
import jetbrains.buildServer.configs.kotlin.version
version = "2024.03"
initializeSeedProject()

the rest are defined in a java-like package hiearchy of .kt and various script files etc. All kotlin code goes in kt files, but I think at this point after the failover/recreating project exercise that seem to fix everything, I'm suspecting that the size of the project might not be the problem at least in the direct form.

I thought it might be some caching teamcity accumulates about a project, but our prod instance sees a lot of build activity and versioned settings activities (reload/fresh when someone makes changes) that the staging doesn't - most of the time it points to a fork with some test changes, yet both instance experienced the issue. (the database from staging was imported from prod originally).

The two instances did go through the same "old webhooks plugin 1.1.374.403 and settings/syntax on version 2023.11.3 then upgraded teamcity to 2024.03.3 along with plugin 502". I don't know the interactions between the webhooks plugin and teamcity or what teamcity does with caching so maybe some old cached data was carried over?

I had another look through that logging configuration file. I can't see anything in there that would require a network connection.

My apologies. I saved the link directly thinking it was the file so it actually grabbed the github webpage 💀 I can only excuse myself with it was late (i'm in Canada/Eastern time).

really appreciate the effort to dup my setup, on an exciting weekend indeed :) I know it's anything but efficient to troubleshoot this remotely with just descriptions and screenshots without access to the system.

Since the issue is no longer reproducible on our staging instance, I will look into possibly rolling that instance to the previous version along with the plugin, with the logging preset, to see if I can reproduce/get more clues. I can also apply the preset to prod if that didn't go anywhere. Prod produces much more logging so I will schedule a maintenance window for that.

netwolfuk commented 2 weeks ago

import com.company.product.seedProject.initializeSeedProject

Ah, that seems a lot more manageable than what I had, which was 153,000 lines of kts code.

I thought it might be some caching teamcity accumulates about a project, but our prod instance sees a lot of build activity and versioned settings activities (reload/fresh when someone makes changes) that the staging doesn't

My concern is that we might fix it by reloading the project and then it creep back in again over time. Unless I really understand how it happened, I am wary that it will eventually get back into the same state again.

I can't express enough how thankful I am for all your help with this. This is taking up your weekend time (and sleep time by the sounds of it).

netwolfuk commented 2 weeks ago

I have an enterprise licence now. Can I check which services you run on which node(s)? How many secondary nodes do you run? Do you have the tcwebhooks plugins installed on the main only or the secondary(s) was well? Thanks.

radiumsurfer commented 2 weeks ago

we run 4 nodes. Teamcity is still very primary/secondary as opposed to load balanced nodes that are identical. Primary must still be the node that users interact with for “write” operations. Secondary node UIs can be used to view builds and configs but are almost never used except when primary is down as there’s nginx in front to direct traffic. There are a set of Tasks that can be assigned to other nodes in the Nodes configuration page. I try to distribute what can be distributed such as vcs polling, build ingest, triggers etc to all secondary nodes. I’m not sure how teamcity handles plugins but if you’re on the secondary node UI it tells you you can’t install/manage plugins from secondary nodes. Plugins are installed to the data dir which is shared among the nodes. For something like webhooks I assume it’s just dormant on secondary nodes as build info is accessed through primary. I’m also not sure if they allow you to ship compute to other nodes but it does sound like it could be beneficial if webhooks uses a lot of resources when there are a lot of builds. I do recall we initially had problems with url trigger plugin on secondary nodes as it didn’t work on them, but triggers are one of the tasks that can be assigned to other nodes.

Ps I am away from the office this and the following week so my work and response might be delayed.

netwolfuk commented 2 weeks ago

Thanks again for the detailed answer. As I read more about these nodes, their shared filesystem seem like something that could introduce the errors we are seeing, but I need to play around some more.

I'll have a look and see if I have spare memory to run 4 nodes. It seems like plugins are dormant on the secondary nodes unless they are specifically coded to support them, and some special setting is set to whitelist them.

I hope you have a great time away. I'll beaver away on what we have so far.

DrSharky commented 1 week ago

I've managed to make some really good progress on this. I have found the following:

Updates driven by VCS changes without IDs cause ghost webhooks This is because the existing webhooks are not updated because they are not found due to the ID being regenerated.

Also, I was not comparing the result of the existing webhooks against the updated webhooks. The only predictable way to compare them is by ID, but newly loaded webhooks will be assigned an ID if they don't have one defined. This will not match the existing one, so the old webhook is not updated. A new one is created.

I have added logic to compare the IDs after the cache is refreshed, and remove any webhooks from the current project if they don't match one of the new IDs. A line like the following is printed in the log.

[2024-08-05 09:16:35,341] INFO - jetbrains.buildServer.SERVER - WebHookSettingsManagerImpl :: Found the following webhooks in the cache that no-longer map to webhooks in project 'VcsProjectTest'. WebHooks with ids '[id_377052579, id_283758663]' will be removed from the cache.

This does have the downside that IDs will churn. This causes two annoyances.

  1. The history page will link webhook events with webhooks that might no longer be relevant.
  2. If the edit webhooks page is loaded, and then the webhooks change, the edit will fail because the webhook cannot be found.

Ideally, webhooks should have unique IDs. This will work around this issue.

TeamCity's buildTypeUnregistered event is poorly named In 2.0.1, I added a listener for the buildTypeUnregistered event. This allows me to cleanup webhooks that might be interested in a build that has been deleted.

However, I have found that this event is fired when the VCS configuration is reloaded. TeamCity appears to re-read the VCS settings, delete all builds on that project, and then recreate them with the same internal and external IDs. At which point the buildTypeUnregistered fires for each build.

When I see that event, I diligently clean up any webhooks that explicitly reference that build, and remove that build (btxxxx) id from their configuration.

This presents as webhooks having 0 builds registered. Sometimes webhooks still have builds showing. This is because a project edit (either via VCS or from the UI) fires up to 10 events in quick succession. I have a queue backed delay and de-bounce algorithm to only present one of each type of event.

I therefore only see one buildTypeUnregistered and one projectRestored, but they will fire at almost the same time. At which point a race condition exists as to whether the cache is refreshed while the build is marked as deleted or has been "undeleted".

I have removed the buildTypeUnregistered event listener, and builds now reliably show for all webhooks. I will do some further testing to make sure that not listening for this event has no problematic side-effects.

I'll do more testing on various configurations. I hope to have a very early beta release of 2.0.2 available for testing in 24 hours.

Regarding the "ghost webhooks" part you mentioned here, I may be receiving a similar issue. I was looking through open issues to see if my bug is already listed in here, and this sounds similar. When accessing the webhooks/search.html?show=all, I get the message "There are no matching WebHooks found." Yet, when I access admin.html?item=tcWebHooks, then I see "There are 4 WebHooks configured in this TeamCity Server." The webhooks stopped running at some point, and I can no longer see them on the search page. If I go to the project where the webhook is configured, I can see it and select it, but then when trying to edit it, I get an error message in the window saying "A WebHook with that ID was not found in this project." I don't mean to take away focus from the above issues if this is a separate problem, but it seemed similar enough to me to post this here. I'm using version 2.0.1 and the REST API 2.0.1-legacy. TeamCity version is 2023.11.1

netwolfuk commented 1 week ago

I have a dual node setup running, and am seeing some more of the issues you reported @radiumsurfer

I have noticed that if a node comes up and and it doesn't think it is main, will not load the tcWebHooks plugin. This is when you get all the messages saying Unknown tag in plugin-settings.xml for project with internal id 'project100669' (will ignore): webhooks

After that point, the webhooks are only discovered either when the server reboots and determines that it is already the main server or when each project is edited and reloads its own webhooks.

The reason for it not loading the webhooks when the plugin loads is because of this issue.

When the tcWebHooks plugin loads on an already running server (eg, new tcPlugins install, updated tcWebHooks plugin is installed and hot-relaoded, secondary node becomes main), the plugin-settings is loaded but contains no entries.

I think what probably happens is that the plugin-settings.xml is read, but there is no registered handler to read the <webhooks> blocks. TeamCity must cache this processed object, and not reload it again when starting new plugins.

I have started working on converting over to SProjectFeatureDescriptor based configuration. Once this is done I can get settings.kts support working and can allow you all to maintain your webhooks in there.

netwolfuk commented 1 week ago

Hi @DrSharky Thanks for your post. Can I please get a bit more detail about your TeamCity configuration?