osquery backend development discussion

marpaia commented 9 years ago

Over the past few months, the osquery engineering team at Facebook has been hard at work on a lot of features and product improvements. As osquery development continues and we add more and more functionality, it's clear that many of the features that we're working on will require some sort of backend infrastructure. For example, consider #201 (ad-hoc distributed queries) and #722 (event-based filesystem integrity monitoring). Both of these features require a full, comprehensive understanding of the data collected by osquery across your entire infrastructure.

We've spent a lot of time reasoning about how we're going to solve this problem sustainably. Above all, we want to be “open”. We started the osquery project as an open source project because we wanted to be able to help improve the security posture of Facebook, as well as other companies, all over the world.

When we create software, we want the user experience of using the software to be amazing. That's why osquery's operating system analytics features are based on SQL, an easy, approachable query language. If you check out http://osquery.io/downloads/, it's clear that we care really strongly about giving people a turn-key solution that they can use in their environment.

When we think about what osquery's backend will look like, we want to deliver a turn-key solution for everyone's environment. Unfortunately, all of the components that are required for a backend product of this scale can't be packaged up into an RPM and installed on a server, turn-key style.

We're exploring building a backend for osquery on top of Facebook production infrastructure. We have a few tricks up our sleeves, and we still want you to be able to use it and benefit from it, regardless of where you work.

Why do we feel that this is a good idea?

Quality

We believe that if we build this product using well-established Facebook technology, we'll be able to deliver a better product. Better for you, better for everyone.

Ease of use

Anyone who has ever deployed “vendor software” can tell you, it's often a pain. We don't want you to have to run dozens of databases, a search cluster, a work queue, web servers, etc. just so that you can use osquery across your environment.

Scalability

At Facebook, we have a lot of infrastructure. When we reason about quantity, “Our Whole Fleet” means something different at Facebook than it does at most organizations. When we build osquery, we build it with the intention of being able to perform at a massive scale. We're confident that we can build a more scalable product by standing on the shoulders of the giants that we have internally.

Will I be able to use this at my company?

You might think that just because we're considering using Facebook infrastructure to build our backend, we're cutting the rest of the world out of the picture. That couldn't be further from the truth! We're building this using Facebook infrastructure so that we can offer it, without reservation, to more people, more easily.

What happens with development now?

We're going to continue to develop the code for the client (osqueryi/osqueryd) completely in the open. Nothing will change on that front. You might start seeing some Pull Requests being committed that implement certain network features, API calls, etc.

As always, osquery is an open source product. You, the public, control what it can and cannot do. You can and should continue to submit Pull Requests, file issues, reach out to us on IRC, etc.

Am I still going to be able to use my own infrastructure with osquery?

Absolutely. When we first started building osquery, a top-level goal was it's ability to integrate with various internal systems. To that end, we've created numerous plug-in interfaces that allow you to manipulate it's functionality in various ways. We're going to be using the same plug-in APIs to implement all of this functionality. You can use the existing open source plug-ins to tie osquery into your infrastructure, just as you can do now. You can also follow the guides (https://github.com/facebook/osquery/wiki/registering-logger-plugins) on the wiki to create your own internal plugins, just as you can do now.

When is this going to be ready?

This is something that we're only beginning to work on. We want to share it with all of you as soon as possible, but not before the product is robust enough to support intuitive, easy use. Expect more updates soon!

umareddy commented 9 years ago

@marpaia could you please clarify:

Will the backend infrastructure code be open sourced? From reading your post I get the impression that it will not be open sourced.
Am I reading correctly that Facebook will allow my company (with few 100 Macbooks and servers) access to the backend for free? Will I be able to run distributed queries across my company's computers using FB infrastructure? Will the FB backend infrastructure support multi-tenancy?
If the backend is not open sourced then can we build our own backend using the changes FB is making in osquery to enable communication with the backend.

Thank you. It is very generous of FB to open source osquery.

marpaia commented 9 years ago

@umareddy, sure! Thanks for the great questions.

Question 1

Most of Facebook's backend development happens in a single monolithic repository. The code we write will take advantage of many Facebook-internal APIs, frameworks, data storage mechanisms, etc. Therefore, the code wouldn't be useful to the general public because there would be no way that the general public could run the code.

Technically, maintaining synchronization from internal repositories to external repositories is hard. We do it for a lot of our open source projects and it's never a good time. That's why, among a few other reasons, we develop osquery on GitHub first, and then sync it into our internal repositories.

All in all, syncing code from internal repositories would require ongoing technical commitment for little technical value, so we probably won't do it.

Question 2

This is a few questions, so I'm going to try to break it up and answer them individually.

Will this be free

Storing data requires servers, electricity, physical space, etc. All of those things come out of a budget that we have for capacity. If the amount of capacity that we need to serve the community exceeds the capacity that we have internally allocated, we may have to figure something out. I'm not sure. It's honestly too early to say. We'll have to do some extensive capacity planning and run that against our budget restrictions to see what we can do with what we have.

In short, I'd like this to be as free and open as possible, but that may not be logistically reasonable, so we'll have to do some math and see what we can come up with.

Will I be able to use this to run distributed queries across my infrastructure

Yes, definitely.

Will this support multi-tenancy?

The colocation of data is something that we recognize will be something of concern for people. Right now, there are a few caveats about how our databases work that make this question a little tricky to answer generically. We may or may not colocate your data with other data.

Rest assured that, regardless of where your data is located, keeping your data (and everyone else's data) safe is our number one priority. As the general architecture of our data model matures, we should have a more concrete answer to this question.

Additionally, I think that questions about the colocation of data should be more along the lines of "why should I believe this data will be safe" as opposed to "will other data be on the same physical server?". Wether or not data is collocated on the same server doesn't say much about the security of the data, so it's a bit of a strawman in my opinion.

Question 3

If you'd like to build your own backend, you can totally still do that. I hope that the backend not being open source isn't WHY you do that, but you'll have the technical ability to do so, absolutely. We're building this using only the APIs that are available to everyone else as well. If nothing else, this will serve as a good set of example plugins for you :)

marpaia commented 9 years ago

I'm closing this. Our plans have moved on a bit from this and the road forward looks a bit different. I'll post another update soon.

Verma-Ashish13 commented 8 years ago

@marpaia ...I have been assigned a task of deploying osquery for my institutional Server and computers and then highlight the effect of security in that. But i didn't get why Osquery when we can figure out these data with some system calls. Any way I will very grateful if you share some light on that.

marpaia commented 8 years ago

@Verma-Ashish13 I'm not sure if I understand your question. If you're inquiring as to why osquery is useful over writing something yourself which uses syscalls, then I encourage you to come hang out in our Slack channel and discuss the issue with the many users of osquery. You can create an account here: https://osquery-slack.herokuapp.com/

osquery / osquery