kolide / fleet

A flexible control server for osquery fleets
https://kolide.com/fleet
MIT License
1.1k stars 263 forks source link

External Data Sinks for OSQuery Results and Status messages #1875

Closed jrossi closed 5 years ago

jrossi commented 6 years ago

I am about to spend to much time on planes and as PoC I am planning on adding osquery results logging directly into fleet. This would allow the results to be processed downstream and outside of fleet.

Normally I would just reach for kafka and move on, but after looking at the code base I think NATs https://github.com/nats-io/gnatsd might be a slightly better option. Here is why:

With that being said ignoring kafka is crazy, but for a PoC I think simple is better.

Sent with GitHawk

groob commented 6 years ago

I think adding pubsub somewhere would be a good idea for us to support a more pluggable pipeline. It might reqiure some difficult refactoring before we can integrate it, but it's likely worth doing it.

I'd love to see an abstraction like this somewhere https://github.com/NYTimes/gizmo/tree/master/pubsub

@marpaia, @zwass and I have all discussed various strategies for how to integrate something like that in fleet. Idk if we have something written down at the moment but would love to discuss an approach.

(any POC would be great too).

jrossi commented 6 years ago

1899 was created as Proof of Concept of this. I open the PR well before it's done to gather feedback early :)

jrossi commented 6 years ago

I have a much better understand of how Fleet works, and how the code is put together after doing my POC in #1899. With this I would like to proposal the following as a method for integration of external data sinks for osquery results/status.

Proposal

Rather then adding more server configuration that needs to be completed on hosts running fleet. It would be simpler to add the routing of data to sinks using apiVersioned yaml specification.

This would keep the server configuration very simple while also allowing a simple path users that upgrade. Here is an example of what the config would look like.

apiVersion: v1
kind: output
spec:
  name: windows_kafka_queue
  descriptions: All windows hosts request pro
  type: kafka 
  kafka:
     topic: fleet.windows.results
     brokers: 127.0.0.1:9018
  match:
    - type: results
    - platform: windows
---
apiVersion: v1
kind: output
spec:
  name: unix_nats_queue
  descriptions: All
  type: kafka 
  kafka:
     topic: fleet.unix.{.PlatForm}
     brokers: 127.0.0.1:9018
  match:
    - type: results
    - platform: !windows
---
apiVersion: v1
kind: output
spec:
  name: local_results
  descriptions: All
  type: file
  file:
     path: /tmp/fleet_results
     log_rotate: false
  match:
    - type: results
---
apiVersion: v1
kind: output
spec:
  name: local_status
  descriptions: All
  type: file
  file:
     path: /tmp/fleet_status
     log_rotate: false
  match:
    - type: status
marpaia commented 6 years ago

I think we should go back and forth on the file format here, but I love the idea of defining logging pipeline details with the file format! Great idea, @jrossi.

zwass commented 6 years ago

It's not clear to me that Fleet should be in the business of sophisticated routing to different output streams. @jrossi can you give some examples of other projects that do sophisticated routing like this (vs. simple output and handling complex routing in the logging pipeline)?

jrossi commented 6 years ago

@zwass sure. The one the comes mind is different parts of the same org consuming things in different ways. Security teams want to know the results and status messages for knowing the state of the world. While desktop support needs the data in splunk, but the desktop engineering wants to push the data into sql server to build global list of all listening ports. I know of 5-6 different engineering teams in my company that would consume just a sub-set of the data.

While I understand the question is seams best to do the routing where all the state information is stored. Otherwise we have to expand the results with the host metadata from fleets database into the results and send them on. With the host metadata added effective filtering can happen outside of fleet. While doable does not seam like the best choice, but I could be missing some or a better way.

Here is my companies use case: Currently I am also working on building pull request model for our internal teams to define scheldued queries and have them role out in groups of hosts using fleetctl and labels. With routing defined as I proposed I would also be let them consume the data they wanted in a manor they wanted.

In the future if all this makes sense is how to effective route results results name of schleuded queries so results can be filtered even more. Think something like the following, but I have not gotten that far into this yet.

match:
    query-name-startwith: desktop-suppory

Or

match:
   query-name: 
         - secops.*
         - ir.* 
jrossi commented 6 years ago

Here is what I would think would be a sane way forward:

Given this is a rather large number of changes and dependent on each other how would you like me to move forward?

therealmik commented 5 years ago

Hi, is any work ongoing with this?

If not, I'd like to implement a very simple Google PubSub publisher (no routing/filtering logic).

zwass commented 5 years ago

Hi @therealmik. I just pushed support for logging to AWS Firehose in #2022. There is no work ongoing to enabling routing/matching for output plugins. You could totally follow the patterns introduced in that PR to add support for GCP pubsub.

therealmik commented 5 years ago

That looks great - it should be pretty easy to write with all the hard work done already :)

therealmik commented 5 years ago

FYI, the code is written, I'm just doing some testing before creating the PR

therealmik commented 5 years ago

Hi @zwass - the filesystem module is changed from previous releases - it no longer writes newlines between messages in the results log.

On this line: https://github.com/kolide/fleet/pull/2022/files#diff-b016735b4714c1a8b7942556a87e4b68R205

The append must be allocating new buffers almost always, so the append is lost. I'm not really sure if the newline is needed for firehose (it isn't for pubsub) - if not, maybe move it into the filesystem logger anyway?

zwass commented 5 years ago

@therealmik Great find there. Another user reported the symptoms of that problem as well. That was a mistake on my part. I just put up https://github.com/kolide/fleet/pull/2029 to fix this. The newline is needed in both filesystem and Firehose. These new changes ought to work appropriately with pubsub.

therealmik commented 5 years ago

Awesome, that patch looks good to me.

zwass commented 5 years ago

I am going to close this as we now have both AWS Firehose and GCP Pubsub as logging output plugins, and the pattern is established for adding new ones (see https://github.com/kolide/fleet/pull/2022 and https://github.com/kolide/fleet/pull/2049).