idaholab / Malcolm

Malcolm is a powerful, easily deployable network traffic analysis tool suite for full packet capture artifacts (PCAP files), Zeek logs and Suricata alerts.
https://idaholab.github.io/Malcolm/
Other
353 stars 58 forks source link

address issues with NetBox database and Logstash's NetBox cache #259

Closed mmguero closed 11 months ago

mmguero commented 1 year ago

I'm not going to hold up the v23.09.0 release for this, but this is an issue I just discovered that needs to be address:

The Logstash Ruby script in charge of enrichment from netbox uses some LRU caches to avoid frequent repetitive API lookups of items.

However, the netbox database is not static: things can be added (not so much of an issue, since a cache miss will just cause the thing to be looked up) but also deleted and changed.

So Logstash could be going along and have information in its cache that is now invalid because the underlying database has been changed.

I don't think there's a good way to "trigger" Logstash when changes are written to NetBox (externally to auto-population, like through the UI or through ./scripts/netbox-restore). But we need to consider how to handle this.

At the very least, a complete wipe/restore of the database via ./scripts/netbox-restore needs to trigger a complete Logstash restart (or, if we want to be more subtle about it, somehow notify logstash to clear all of its caches).

This needs some careful thought to figure out how to deal with it. On one hand, we know NetBox is using postgresql and redis to do its own database and caching: maybe we simply don't need to cache in LogStash at all and it won't be too expensive to just call the API every time? I don't know.

mmguero commented 11 months ago

After looking a little closer at this, I think that's the approach I'm going to take. Since netbox is already caching things I think we're adding complication without a lot of gain by doing our own caching on top of that.

mmguero commented 11 months ago

Working on it, but as of mmguero-dev/Malcolm@cab66bd it's broken (seems to be a concurrency issue). Just as a reminder to myself...

mmguero commented 11 months ago

The good news is I've removed our extra layer of caching and it does seem to be more consistent now. The bad news is the CPU load increases quite a bit when, especially when doing autopopulation due to the increased API load. I need to do some benchmarks to compare, and also compare when autopopulation is turned on vs. not.

mmguero commented 11 months ago

Okay, on further analysis I think actually we're going to have to come to a middle ground: still do the caching, but with two changes: 1) decrease the TTL significantly (from 600 seconds down to maybe like 30 or 60 seconds) and 2) have ALL of the caches in the file be a TTL cache, rather than having the site/role/device type just being LRU.

I think removing the caching completely will just make it too slow.

mmguero commented 11 months ago

Closing as per my last comment and commit.

Some rough benchmarks (the last number is milliseconds in the filter):

With no caching, autopopulate on:

ruby_netbox_enrich_destination_ip_segment;87489;87489;3017294
ruby_netbox_enrich_source_ip_segment;96610;96610;3411570
ruby_netbox_enrich_source_ip_device;96610;96610;10968202
ruby_netbox_enrich_destination_ip_device;87489;87489;11361163

With no caching, autopopulate off:

ruby_netbox_enrich_destination_ip_segment;85755;85755;169437
ruby_netbox_enrich_source_ip_segment;94305;94305;222455
ruby_netbox_enrich_destination_ip_device;85755;85755;958852
ruby_netbox_enrich_source_ip_device;94305;94305;1607698

Although this is rough (the numbers of the events aren't exactly the same), you can see that without the caching it's like 10x+ worse.