open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.06k stars 2.36k forks source link

Windows Event Log Receiver doesn't collect some legacy events correctly #35135

Closed dharley-atlastix closed 1 month ago

dharley-atlastix commented 1 month ago

Component(s)

receiver/windowseventlog

What happened?

Description

When the Windows Event Log Receiver encounters an older style event, it is unable to correctly ingest the event details as the vendor xml template is seemingly missing. This "appears" to be because the source of the event uses some other legacy templating system (DLL) to create the event, and the receiver is setup to expect that a Vendor XML Template will always exist. Assuming the XML lookup is used because it is more efficient than directly rendering, it maybe best to implement a fallback strategy to ensure all data is captured using an alternative rendering API.

Steps to Reproduce

Setup a windows event log receiver and capture the events generated by the following built in Windows commands: diskpart chkdsk /c This has also been observed from events generated by the DHCPServer service and VSS, however recreating these conditions is more complicated.

Expected Result

No error should be logged about a missing template as the event is structured correctly in the Windows Event Log viewer. All fields should be populated the same as the Event Log viewer and per other events extracted via the receiver.

Actual Result

The following error is logged:

2024-09-11T12:01:22.070+1000    error   windows/input.go:283    Failed to render formatted event        {"kind": "receiver", "name": "windowseventlog/system", "data_type": "logs", "operator_id": "windows_eventlog_input", "operator_type": "windows_eventlog_input", "error": "syscall to 'EvtFormatMessage' failed: The message ID for the desired message could not be found."}
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).processEvent
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.109.0/operator/input/windows/input.go:283
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).read
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.109.0/operator/input/windows/input.go:223
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).readToEnd
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.109.0/operator/input/windows/input.go:198
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/operator/input/windows.(*Input).readOnInterval
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza@v0.109.0/operator/input/windows/input.go:186

Fields such as the Message, Keywords, Opcode contain missing or unexpected data.

{
    "channel": "System",
    "computer": "SVRSYDDHCP01",
    "event_data": {
        "data": [
            {
                "": "@2010001"
            }
        ]
    },
    "event_id": {
        "id": 4,
        "qualifiers": 16896
    },
    "execution": {
        "process_id": 0,
        "thread_id": 0
    },
    "keywords": [
        "0x80000000000000"
    ],
    "level": "4",
    "message": "",
    "opcode": "0",
    "provider": {
        "event_source": "",
        "guid": "",
        "name": "Virtual Disk Service"
    },
    "record_id": 29962,
    "system_time": "2024-09-11T02:01:21.5315561Z",
    "task": "0"
}

image

The Provider used to contain incomplete data in 0.107 and below, but the recent change to rendering the name using system properties appears to have addressed this. https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34914

Collector version

v0.109.0

Environment information

Environment

OS: ("Windows 2022") Compiler(n/a)

OpenTelemetry Collector configuration

receivers:
  windowseventlog/application:
    channel: Application
    poll_interval: 1s
    exclude_providers: [VSS]
  windowseventlog/system:
    channel: System
    poll_interval: 1s
  windowseventlog/security:
    channel: Security
    poll_interval: 1s
    exclude_providers: [Service Control Manager]

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [windowseventlog/application, windowseventlog/system, windowseventlog/security]
      exporters: [debug]

Log output

No response

Additional context

No response

github-actions[bot] commented 1 month ago

Pinging code owners:

pjanotti commented 1 month ago

/label os:windows

pjanotti commented 1 month ago

@dharley-atlastix as you said the receiver should actually fallback to the non-formatted event rendering when the event source is empty - at least the JSON that you attached and the event that I tested had an empty event source. If there is no source there isn't much to be "formatted" anyway. That said I couldn't repro the issue, as if the EvtFormatMessage API automatically does the fallback in my case. I ran chkdsk /c on Windows 11 and both the Event Viewer and the collector were able to render the event properly and no error logged by the collector. It seems that the box used in the report behaves different than the one that I used. Could you check if the collector logged any error opening an event source provider? The error should contain the failed to open the metadata for the string.

Anyway, this behavior can be changed on the collector to avoid the issue.

For reference the chkdsk /c event captured and processed on a Windows 11 box:

image

JSON captured by the collector:

{
    "channel": "Application",
    "computer": "<computer_name>",
    "event_data": {
        "binary": "00BB570053625100016C8D0000000000F20300004CCC0B000000000000000000",
        "data": [
            {
                "": "<chkdsk /c output omitted for brevity>"
            }
        ]
    },
    "opcode": "Info",
    "provider": {
        "event_source": "",
        "guid": "",
        "name": "Chkdsk"
    },
    "record_id": 3917664,
    "system_time": "2024-09-16T20:45:49.4408698Z",
    "task": "0"
}
dharley-atlastix commented 1 month ago

@pjanotti Your interesting results made me go back and take a look in to the issue a bit further. Testing on Windows 11 as per your method yielded no issues, likewise testing an RTM Server 2016 was the same. The test machines that are exhibiting and issue are all Server 2022 RTM (minimal test images with no updates). Updating the machines the the latest CU has resolved the issue. So this appears to be some kind of Windows bug or environmental issue, which if I'm honest, I don't intend to spend anymore time looking in to unless it pops it head up somewhere else.

Interestingly I tried test rendering the message field using a EvtFormatMessageEvent parameter against the a modifed evtFormatMessage function, and it returned the message data as per the eventlog. This suggest its possible to build a method to fallback when there when ErrorEvtMessageIdNotFound is returned (such as no vendor template), but for now that seems more like a feature request outside the scope of this ticket.

pjanotti commented 1 month ago

Thanks for looking back at all these combinations @dharley-atlastix - this is useful information! I appreciate that.

Given that it happened in your deployment and we can workaround the issue in a transparent fashion, I'm going ahead with PR #35227. When that is released you shouldn't see the issue even on the 2022 RTM with no updates.