grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.83k stars 3.43k forks source link

Promtail's windows_events scraper should produce parseable, structured data #10608

Open RamblingCookieMonster opened 1 year ago

RamblingCookieMonster commented 1 year ago

Is your feature request related to a problem? Please describe.

The current implementation of the windows_events scraper for promtail does not fully parse Windows events into parseable structured data. There are many negative outcomes across various stakeholders; for example:

You can likely imagine other reasons. Parseable structured data is sort of critical in the world of logs, and systems that use their data.

Here's an example of what you produce today via Promtail's windows_events:

{
  "source": "Microsoft-Windows-TaskScheduler",
  "channel": "Microsoft-Windows-TaskScheduler/Operational",
  "computer": "REDACTED",
  "event_id": 201,
  "version": 2,
  "level": 4,
  "task": 201,
  "opCode": 2,
  "levelText": "Information",
  "taskText": "Action completed",
  "opCodeText": "Stop",
  "keywords": "0x8000000000000000",
  "timeCreated": "2023-09-14T17:10:00.579307200Z",
  "eventRecordID": 2295677,
  "correlation": {
    "activityID": "{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}"
  },
  "execution": {
    "processId": 1816,
    "threadId": 2892,
    "processName": "svchost.exe"
  },
  "security": {
    "userId": "S-1-5-18",
    "userName": "NT AUTHORITY\\SYSTEM"
  },
  "event_data": "<Data Name='TaskName'>\\REDACTED</Data><Data Name='TaskInstanceId'>{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}</Data><Data Name='ActionName'>C:\\Windows\\SYSTEM32\\cmd.exe</Data><Data Name='ResultCode'>0</Data><Data Name='EnginePID'>4628</Data>",
  "message": "Task Scheduler successfully completed task \"\\REDACTED\" , instance \"{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}\" , action \"C:\\Windows\\SYSTEM32\\cmd.exe\" with return code 0."
}

Notice the event_data. It is not parsed into named fields (in this case, TaskName, TaskInstanceId, etc., or preferably with a prefix like Data_TaskName to avoid collisions, as used by Telegraf), it's an XML-ish string bunched into a single field.

This results in folks relying on one-off (not scalable/generalizeable) "solutions" using that XML-y field, or, relying on the Message field (again, which should not be relied on) with rather absurd queries like this, from the previously referenced sigma post:

{job=~"eventlog|winlog|windows|fluentbit.*"} 
| json | label_format Message=`{{ .message | replace "\\" "\\\\" | replace "\"" "\\\"" }}` 
| line_format `{{ regexReplaceAll "([^:]+): ?((?:[^\\r]*|$))(\r\n|$)" .Message "${1}=\"${2}\" "}}` 
| logfmt | event_id=1 and ...

Describe the solution you'd like

Parse EventData and UserData please. You likely should do this on the Windows/promtail side of the house. I cannot help you here, but I can at least point out that Telegraf, Winlogbeat, Splunk, and presumably other agents can do this (IMHO) bare-minimum windows event parsing.

for example, "event_data": "<Data Name='TaskName'>\\REDACTED</Data><Data Name='TaskInstanceId'>{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}</Data><Data Name='ActionName'>C:\\Windows\\SYSTEM32\\cmd.exe</Data><Data Name='ResultCode'>0</Data><Data Name='EnginePID'>4628</Data>" might expand to:

{
  "TaskName": "\\REDACTED",
  "TaskInstanceId": "{6191c9fe-4655-4af1-bfbe-8d48d51ee41e}",
  "ActionName": "C:\\Windows\\SYSTEM32\\cmd.exe",
  "ResultCode": 0,
  "EnginePID": 4628
}

Considerations would need to be made as to escaping " and \ within values, I've just written the above by hand so it's not going to be perfect. You might also prefix the keys - e.g. Data_TaskName, Data_ResultCode and move them to the root level (or make this an option). Particularly if you want to help the community, who might be relying on Telegraf, which uses that convention (Data_ prefix).

This should cover UserData as well, on the subset of events with this.

Describe alternatives you've considered

Additional context

Not much. A few references:

If this is just me holding it wrong, please let me know, but after a few days of reading and testing, I'm pretty confident this is indeed not in place. I include it as a "feature", but to me, for a logging solution, this is more a "bug". Thanks!

RamblingCookieMonster commented 1 year ago

Oh dear, I see the example I linked actually turned into a parse-the-message-field implementation. While... that is something, and took time and effort, I want to emphasize that for Windows, that is absolutely not the approach to take, though I totally understand that perhaps not everyone using promtail/Loki has Windows experience..

There are other references, but take this Microsoft provided spreadsheet that focuses solely on the Security log event IDs, and presumably, a subset as things have changed. Note the Complete Event Messages sheet. This is illustrating how Windows Events work (there are far better / deeper references, but this is a simply way to illustrate it). For example:

Event ID 4713:

Kerberos policy was changed.

Subject:
 Security ID:  %1
 Account Name:  %2
 Account Domain:  %3
 Logon ID:  %4

Changes Made:
('--' means no changes, otherwise each change is shown as:
(Parameter Name): (new value) (old value))
%5

%5 would not be captured in this case.

Event ID 4899:

A Certificate Services template was updated.

%1 v%2 (Schema V%3)
%4
%5

Template Change Information:
 Old Template Content: %8
 New Template Content:  %7

Additional Information:
 Domain Controller: %6

More data that would not be parsed

Event ID 4624:

An account was successfully logged on.

              Subject:
                  Security ID:        %1
                  Account Name:        %2
                  Account Domain:        %3
                  Logon ID:        %4

              Logon Type:            %9

              New Logon:
                  Security ID:        %5
                  Account Name:        %6
                  Account Domain:        %7
                  Logon ID:        %8
                  Logon GUID:        %13

              Process Information:
                  Process ID:        %17
                  Process Name:        %18

              Network Information:
                  Workstation Name:    %12
                  Source Network Address:    %19
                  Source Port:        %20

              Detailed Authentication Information:
                  Logon Process:        %10
                  Authentication Package:    %11
                  Transited Services:    %14
                  Package Name (NTLM only):    %15
                  Key Length:        %16

              This event is generated when a logon session is created. It is generated on the computer that was
              accessed.

              The subject fields indicate the account on the local system which requested the logon. This is most
              commonly a service such as the Server service, or a local process such as Winlogon.exe or Services.exe.

              The logon type field indicates the kind of logon that occurred. The most common types are 2
              (interactive) and 3 (network).

              The New Logon fields indicate the account for whom the new logon was created, i.e. the account that was
              logged on.

              The network fields indicate where a remote logon request originated. Workstation name is not always
              available and may be left blank in some cases.

              The impersonation level field indicates the extent to which a process in the logon session can
              impersonate.

              The authentication information fields provide detailed information about this specific logon request.
                  - Logon GUID is a unique identifier that can be used to correlate this event with a KDC event.
                  - Transited services indicate which intermediate services have participated in this logon request.
                  - Package name indicates which sub-protocol was used among the NTLM protocols.
                  - Key length indicates the length of the generated session key. This will be 0 if no session key was
              requested.

So... Maybe the parsing accounted for this, but how would this parse Security ID and differentiate the subject from the new login? Also, do you see how long that field is with all that text? So in addition to the real event_data, this massive string is sent for an event ID that is quite, quite common in busy environments.

That data should be in a much more compact set of fields that windows provides, but which is not currently parsed. Here's an example from winlogbeat, which among other agents, parses this data without relying on the Message field:

    "event_data": {
      "ProcessName": "C:\\Windows\\System32\\lsass.exe",
      "LogonGuid": "{00000000-0000-0000-0000-000000000000}",
      "TargetOutboundDomainName": "-",
      "VirtualAccount": "%%1843",
      "IpPort": "52024",
      "TransmittedServices": "-",
      "LmPackageName": "-",
      "RestrictedAdminMode": "-",
      "ElevatedToken": "%%1842",
      "WorkstationName": "REDACTED",
      "SubjectDomainName": "REDACTED",
      "TargetDomainName": "REDACTED",
      "LogonProcessName": "Advapi  ",
      "LogonType": "3",
      "SubjectLogonId": "0x3e7",
      "KeyLength": "0",
      "TargetOutboundUserName": "-",
      "TargetLogonId": "0x1a2497c9f",
      "TargetLinkedLogonId": "0x0",
      "SubjectUserName": "REDACTED$",
      "IpAddress": "REDACTED",
      "ImpersonationLevel": "%%1833",
      "ProcessId": "0x530",
      "TargetUserName": "REDACTED",
      "SubjectUserSid": "S-1-5-18",
      "TargetUserSid": "S-1-5-21-REDACTED",
      "AuthenticationPackageName": "MICROSOFT_AUTHENTICATION_PACKAGE_V1_0"
    },

Cheers!

cstyan commented 10 months ago

Hello, thanks for reporting this.

We're currently reevaluating promtails position as a project within Grafana Labs. Internally we're actually using the Agent for both metrics and logs collection at this point. Additionally, the agent team is more likely to have time to dedicate to this. It's likely a fix would only go into the agent, but if there's an argument for adding a change here in promtail as well that can be discussed.

At the very least, the Agent team is actually going to have people who would have context about Windows in general

mennotech commented 6 months ago

@RamblingCookieMonster Would you consider opening this issue with the Grafana Agent team? I am running into the same issue using the Grafana Agent. You've spent the time creating a well crafted issue / feature request, and it would be great if the appropriate team was notified. I could try creating the request there, but it wouldn't be as thorough a post as you have here.

It seems that winlogbeat parses the event_data into separate fields (see https://www.elastic.co/guide/en/beats/winlogbeat/current/exported-fields-winlog.html#_event_data).

My work around may be to have winlogbeat write the windows security events to a text file and then have Grafana Agent read this file and push it to loki. This should work, but greatly complicates the setup.

RamblingCookieMonster commented 6 months ago

@mennotech - feel free to borrow from this and/or copy it over! Yeah, we ended up avoiding the write back thing for this and a few other spots it would have been handy (it also puts a bit more pressure on IO/storage, but it does work, good find!). Ultimately, we're going to likely end up using Splunk for this sort of data, so while this is something I would encourage Grafana Labs to implement, it's not something I'll have time to push for. Cheers!