kolide / fleet

A flexible control server for osquery fleets
https://kolide.com/fleet
MIT License
1.1k stars 261 forks source link

Discrepancy between hosts labeled and hosts returning with the same query #1852

Open benjamin-hering opened 6 years ago

benjamin-hering commented 6 years ago

What version of fleet are you using (fleet version --full)?

version: "2.0.0-rc2", branch: "master", revision: "c4f2850e38a611a44189adf80e433bf38046e387", go_version: "go1.10.2", build_date: "2018-06-18T19:21:18Z", build_user: "zwass"

What operating system are you using?

Docker running on CentOS 7

What did you do?

Added a label to categorize hosts off pattern matching on the hostname

What did you expect to see?

I expected to see the number of hosts tied to that label to be the same as the number of hosts that would respond to the same query run ad-hoc

What did you see instead?

I saw a much smaller number of hosts in the label.

There's a few different labels I have all with a similar pattern. As an example the label I have for datacenter dev hosts. It has 77 hosts

screen shot 2018-06-26 at 11 06 57 am

But running the label query against all hosts gives me 474 responses.

screen shot 2018-06-26 at 11 06 46 am

zwass commented 6 years ago

I'd love to dig further into this. Can you find a host that returns results for this live query but is not in the label? Once you find that host, start osqueryd with the normal options and add --tls_dump --database_path=/tmp/osqueryd_test and provide some logs of the output. We want to look for whether it receives the appropriate query, and whether it returns the results for that query.

benjamin-hering commented 6 years ago

So it took me a while to remember that I had turned down the log levels and was stumped for a while as to why tls_dump wasn't giving me much output. Once I figured out that I played myself, I spun it back up. I dropped some of the finer details of some of our specific environment queries to make it more readable, but the log is other wise unchanged.

I find it curious that some of the label queries are coming through, along with the general kolide system info gathering. For example:

I0628 12:37:41.878635 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_os_version: select * from os_version limit 1
I0628 12:37:41.881649 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_osquery_flags: select name, value from osquery_flags where name in ("distributed_interval", "config_tls_refresh", "config_refresh", "logger_tls_period")
I0628 12:37:41.888379 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_osquery_info: select * from osquery_info limit 1
I0628 12:37:41.901291 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_system_info: select * from system_info limit 1
I0628 12:37:41.905676 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_uptime: select * from uptime limit 1
I0628 12:37:41.907933 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_17: SELECT hostname AS hostname FROM system_info
WHERE hostname LIKE "uw2%";
I0628 12:37:41.910668 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_6: select 1;
I0628 12:37:41.912742 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_9: select 1 from os_version where platform = 'centos' or name like '%centos%'

kolide_label_querey_17 is definitely one of mine. Also curious what's going on exactly with kolide_label_query_6, that seems malformed.

Regardless, here's the full log excerpts.

I0628 11:44:57.502751 13496 events.cpp:825] Event publisher not enabled: syslog: Publisher disabled via configuration
I0628 11:44:57.520993 13496 events.cpp:708] Subscriber expiration is too low: file_events
I0628 11:50:15.419867 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 11:50:15.606704 13515 query.cpp:107] Storing initial results for new scheduled query: pack/Find Evil/Find Evil - Listening Ports
I0628 11:53:54.660626 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
I0628 11:53:55.151022 13515 query.cpp:107] Storing initial results for new scheduled query: pack/Find Evil/Find Evil - Running processes not on disk
I0628 11:58:31.335737 13515 scheduler.cpp:75] Executing scheduled query pack/Status & Baselines/Status - Files Changed in lcapp in the last hour: SELECT COUNT(md5) FROM file_events
-- In the past hour is unix_time minus 3600 seconds
WHERE time > ((SELECT unix_time FROM time) - 3600)
AND category="lcapp"
;
I0628 12:03:19.662750 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - ARP anomalies: SELECT address, mac, address_count
FROM
(SELECT address, mac, COUNT(mac) AS address_count FROM arp_cache WHERE NOT mac="00:00:00:00:00:00" GROUP BY mac)
WHERE address_count >1
AND ... more query details
;
I0628 12:03:19.835664 13515 query.cpp:107] Storing initial results for new scheduled query: pack/Find Evil/Find Evil - ARP anomalies
I0628 12:04:41.903179 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 12:09:58.261546 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
E0628 12:17:07.255290 13513 udev.cpp:91] udev monitor returned invalid device
I0628 12:19:07.815693 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 12:20:47.077489 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - suspicious file access: SELECT file_events.action, file_events.time, file_events.target_path, users.username
FROM file_events JOIN users ON file_events.uid = users.uid
WHERE ... more query details;
I0628 12:26:01.817616 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
I0628 12:33:34.465849 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 12:37:41.871425 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_network_interface: select ia.interface, address, mask, broadcast, point_to_point,
                               id.interface, mac, id.type, mtu, metric, ipackets, opackets,
                               ibytes, obytes, ierrors, oerrors, idrops, odrops, last_change
                        from interface_details id join interface_addresses ia
                               on ia.interface = id.interface where length(mac) > 0
                               order by (ibytes + obytes) desc
I0628 12:37:41.878635 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_os_version: select * from os_version limit 1
I0628 12:37:41.881649 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_osquery_flags: select name, value from osquery_flags where name in ("distributed_interval", "config_tls_refresh", "config_refresh", "logger_tls_period")
I0628 12:37:41.888379 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_osquery_info: select * from osquery_info limit 1
I0628 12:37:41.901291 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_system_info: select * from system_info limit 1
I0628 12:37:41.905676 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_uptime: select * from uptime limit 1
I0628 12:37:41.907933 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_17: SELECT hostname AS hostname FROM system_info
WHERE hostname LIKE "uw2%";
I0628 12:37:41.910668 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_6: select 1;
I0628 12:37:41.912742 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_9: select 1 from os_version where platform = 'centos' or name like '%centos%'
I0628 12:42:05.089785 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
I0628 12:48:00.543174 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 12:57:56.944254 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - ARP anomalies: SELECT address, mac, address_count
FROM
(SELECT address, mac, COUNT(mac) AS address_count FROM arp_cache WHERE NOT mac="00:00:00:00:00:00" GROUP BY mac)
WHERE address_count >1
AND ... more query details
;
I0628 12:58:07.976950 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
I0628 13:02:26.436856 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 13:04:06.653898 13515 scheduler.cpp:75] Executing scheduled query pack/Status & Baselines/Status - Files Changed in lcapp in the last hour: SELECT COUNT(md5) FROM file_events
-- In the past hour is unix_time minus 3600 seconds
WHERE time > ((SELECT unix_time FROM time) - 3600)
AND category="lcapp"
;
I0628 13:14:11.790992 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
I0628 13:16:53.228477 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 13:23:44.587118 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - suspicious file access: SELECT file_events.action, file_events.time, file_events.target_path, users.username
FROM file_events JOIN users ON file_events.uid = users.uid
WHERE ... more query details;
I0628 13:30:16.402235 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
I0628 13:31:20.707454 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 13:38:26.144731 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_network_interface: select ia.interface, address, mask, broadcast, point_to_point,
                               id.interface, mac, id.type, mtu, metric, ipackets, opackets,
                               ibytes, obytes, ierrors, oerrors, idrops, odrops, last_change
                        from interface_details id join interface_addresses ia
                               on ia.interface = id.interface where length(mac) > 0
                               order by (ibytes + obytes) desc
I0628 13:38:26.151037 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_os_version: select * from os_version limit 1
I0628 13:38:26.154932 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_osquery_flags: select name, value from osquery_flags where name in ("distributed_interval", "config_tls_refresh", "config_refresh", "logger_tls_period")
I0628 13:38:26.163169 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_osquery_info: select * from osquery_info limit 1
I0628 13:38:26.175634 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_system_info: select * from system_info limit 1
I0628 13:38:26.180313 13514 distributed.cpp:133] Executing distributed query: kolide_detail_query_uptime: select * from uptime limit 1
I0628 13:38:26.183069 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_17: SELECT hostname AS hostname FROM system_info
WHERE hostname LIKE "uw2%";
I0628 13:38:26.186204 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_6: select 1;
I0628 13:38:26.188434 13514 distributed.cpp:133] Executing distributed query: kolide_label_query_9: select 1 from os_version where platform = 'centos' or name like '%centos%'
I0628 13:45:46.253958 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Listening Ports: SELECT DISTINCT process.name, process.cmdline, process.path, listening.port, /*listening.address,*/ process.pid
FROM processes AS process JOIN listening_ports AS listening ON process.pid = listening.pid
WHERE
AND ... more query fine details
I0628 13:46:19.445407 13515 scheduler.cpp:75] Executing scheduled query pack/Find Evil/Find Evil - Running processes not on disk: SELECT * FROM processes
WHERE NOT on_disk=1 AND ... more query fine details
;
^C
zwass commented 6 years ago

So the label query isn't even hitting the host. Is that correct?

kolide_label_query_6 is as expected... It's the "All Hosts" label that should match on any host.

Do you have a "platform" set for any of the label queries? Can you show the results of running select * from labels where label_type != 1; in your Fleet MySQL database (or at least the results for the problematic label and the one with ID 17)?

benjamin-hering commented 6 years ago

Okay, interesting.

Looks like for the uw2 label query that is showing up on the host I didn't set platform, but everything else (which doesn't show up on this test host) I set to CentOS. All of the hosts that we're playing with right now are CentOS, so the overlap betweeen "All Hosts" and "CentOS" should be 100%.

I'll drop the platform modifiers from the labels, and see if that is a viable workaround.

11  2018-05-24 17:36:26 2018-05-24 17:36:26 NULL    0   Cinderella Instances        SELECT hostname AS hostname FROM system_info↵WHERE hostname LIKE "uw2-adhoc-instance-demo-default-nprd%";   centos  0
12  2018-06-19 16:53:24 2018-06-19 16:53:24 NULL    0   Datacenter Dev Hosts        SELECT hostname AS hostname FROM system_info↵WHERE hostname LIKE "sjc%-dev%";   centos  0
13  2018-06-19 16:53:57 2018-06-19 16:53:57 NULL    0   Datacenter Demo Hosts       SELECT hostname AS hostname FROM system_info↵WHERE hostname LIKE "sjc%-demo%";  centos  0
14  2018-06-19 16:54:34 2018-06-19 16:54:34 NULL    0   Datacenter Stage Hosts      SELECT hostname AS hostname FROM system_info↵WHERE hostname LIKE "sjc%stage%";  centos  0
15  2018-06-19 18:51:05 2018-06-19 18:51:05 NULL    0   Datacenter SJC DR Hosts     SELECT hostname AS hostname FROM system_info↵WHERE hostname LIKE "sjc%-prod%";  centos  0
16  2018-06-19 18:51:27 2018-06-19 18:51:27 NULL    0   Datacenter LAS hosts        SELECT hostname AS hostname FROM system_info↵WHERE hostname LIKE "las%-prod%";  centos  0
17  2018-06-21 23:44:18 2018-06-21 23:44:18 NULL    0   US-West-2 AWS Hosts     SELECT hostname AS hostname FROM system_info↵WHERE hostname LIKE "uw2%";        0
benjamin-hering commented 6 years ago

Hmm, I made the edits in the GUI, and double-checking the edits showed that for whatever reason the platform column was still set to centos. I ended up having to clear it via direct query to the database for it to show back as a label for "All Platforms" in the GUI.

benjamin-hering commented 6 years ago

Update, removing the platform qualifier appears to fixed the label queries, and the numbers are now matching the numbers I get from raw queries (and what I'd expect to be seeing).

At this point, my personal needs are satisfied, but it does still feel like there's a bug to squash somewhere if label queries for CentOS hosts aren't hitting all of my CentOS hosts. Is there additional information that would be helpful to investigate?

zwass commented 6 years ago

@benjamin-hering Thanks for digging further into this! It certainly sounds like a there is a bug in the platform handling for label queries (and also in the UI for setting the platform). I'll have to see if I can repro this on my own.

It would help to see a database row from a host that does execute the query, and one from a host that doesn't. In particular, I'd be interesting in what the platform column is set to on each of them.

benjamin-hering commented 6 years ago

@zwass ah! Insightful troubleshooting step

This dev host wasn't picked up

host_name   platform    os_version
sjcbhering01-dev.tlcinternal.com    rhel    CentOS 6.9.0

This host was

host_name   platform    os_version
sjcabp-api-server01-dev.c1.tlcinternal.com  centos  CentOS Linux 7.0.0

Doing a bit more poking at the database, it appears that all of my CentOS 6 hosts show up as platform "rhel", while all my CentOS 7 hosts show up at "centos" Curious then why they all appear in the GUI in places (correctly) as CentOS. Example:

screen shot 2018-07-02 at 10 20 01 am

Perhaps some places are based of platform and others off os_version? Then I guess part of the question then is what's causing the platform to report CentOS 6 hosts as Red Hat. Is that populated by fleet or the underlying osquery data?

zwass commented 6 years ago

The platform stored in MySQL is populated by select platform from os_version in osquery. So I'd be pretty sure those hosts return rhel for that.

The "CentOS Linux" label you see in the dashboard is populated by the label query select 1 from os_version where platform = 'centos' or name like '%centos%' targeted to all platforms. Perhaps this is because at some point someone noticed an issue like you're reporting?

Maybe we need to change the platform selector for labels to be "All", "macOS", "Windows", or "Linux" and then switch it off the build_platform reported in select build_platform from osquery_info.

benjamin-hering commented 6 years ago

Sounds good. At this point, I'm happy to close out the issue. Both the root cause and a workaround for my use case have been found. Do you want to keep this issue open to track changes in Fleet around this CentOS 6 / 7 platform reporting?

nikhilkhairnar commented 5 years ago

el queries are coming through, along with the general kolide system info

Hi, can you please tell me the path where the above logs are stored in ubuntu