Github Audit - Define `token_id` field statically as long or string

damon-edstrom commented 1 year ago

Hello,

I recently adopted a Matano deployment and am having an issue with the audit logs coming from Github that's caused by the automatic schema discovery not properly identifying the type for the token_id field which causes many logs with this field populated to fail the parquet transform. Based on the ECS definition, I'm guessing that it takes the entire log and converts it to parquet before doing dropping the unused fields. Since it's not defined, the discovery maps it to int which is fine for new tokens but the older "classic" token types have more characters and for the most part need to be assigned to long or string for both the new and classic token types to parse properly. If not, actions using a classic token ID, which is typically the majority of them, will plop into the error bucket.

I've verified that this is a parquet specific issue, I ran a spark job with a parquet output while testing and it failed on the same logs Matano is; running the job to output as JSON was successful. While researching what other products are doing, I only found the Panther SIEM product had the field specifically called out and they're casting it to string so they don't have to worry about the format changing in the future. If you let me know preference I can open a PR with an updated schema definition.

I can provide sanitized example logs but it's just a number bigger than an int so I'm not sure how much help it'd be. Please let me know if I can provide any more details or answer any questions.

Samrose-Ahmed commented 1 year ago

To clarify, there's no schema discovery in Matano, there is a specific defined schema. Can you explain where exactly you are seeing an error?

Our current schema for GH audit logs does not mention a normalized token_id field (see https://github.com/matanolabs/matano/blob/main/data/managed/log_sources/github_audit/log_source.yml) so I don't see why there would be an error. Since we are not normalizing it, it is retained in the event.original field. I see the token_id in the original JSON there in my test logs.

If you are proposing adding a normalized token_id field to github.token_id or such, I am open to that since it seems common.

damon-edstrom commented 1 year ago

I think this might've been coincidental, all the schemaMismatch sidelines happened to be PAT Classic containing fields which are what caused Spark to trip up when outputting to parquet. They stopped a week ago I noticed now that I went back to dig some more. We didn't make any changes between now and then so not sure why it stopped. Since token_id is in raw I think folks can parse it from there if needed, only use case I see is that if a user account is compromised because token got out, it'd help to isolate the token making malicious calls.

Closing issue, will reference if it starts again for some reason.

matanolabs / matano

Github Audit - Define `token_id` field statically as long or string #179