Closed mmguero closed 11 months ago
After looking a little closer at this, I think that's the approach I'm going to take. Since netbox is already caching things I think we're adding complication without a lot of gain by doing our own caching on top of that.
Working on it, but as of mmguero-dev/Malcolm@cab66bd it's broken (seems to be a concurrency issue). Just as a reminder to myself...
The good news is I've removed our extra layer of caching and it does seem to be more consistent now. The bad news is the CPU load increases quite a bit when, especially when doing autopopulation due to the increased API load. I need to do some benchmarks to compare, and also compare when autopopulation is turned on vs. not.
Okay, on further analysis I think actually we're going to have to come to a middle ground: still do the caching, but with two changes: 1) decrease the TTL significantly (from 600 seconds down to maybe like 30 or 60 seconds) and 2) have ALL of the caches in the file be a TTL cache, rather than having the site/role/device type just being LRU.
I think removing the caching completely will just make it too slow.
Closing as per my last comment and commit.
Some rough benchmarks (the last number is milliseconds in the filter):
With no caching, autopopulate on:
ruby_netbox_enrich_destination_ip_segment;87489;87489;3017294
ruby_netbox_enrich_source_ip_segment;96610;96610;3411570
ruby_netbox_enrich_source_ip_device;96610;96610;10968202
ruby_netbox_enrich_destination_ip_device;87489;87489;11361163
With no caching, autopopulate off:
ruby_netbox_enrich_destination_ip_segment;85755;85755;169437
ruby_netbox_enrich_source_ip_segment;94305;94305;222455
ruby_netbox_enrich_destination_ip_device;85755;85755;958852
ruby_netbox_enrich_source_ip_device;94305;94305;1607698
Although this is rough (the numbers of the events aren't exactly the same), you can see that without the caching it's like 10x+ worse.
I'm not going to hold up the v23.09.0 release for this, but this is an issue I just discovered that needs to be address:
The Logstash Ruby script in charge of enrichment from netbox uses some LRU caches to avoid frequent repetitive API lookups of items.
However, the netbox database is not static: things can be added (not so much of an issue, since a cache miss will just cause the thing to be looked up) but also deleted and changed.
So Logstash could be going along and have information in its cache that is now invalid because the underlying database has been changed.
I don't think there's a good way to "trigger" Logstash when changes are written to NetBox (externally to auto-population, like through the UI or through ./scripts/netbox-restore). But we need to consider how to handle this.
At the very least, a complete wipe/restore of the database via
./scripts/netbox-restore
needs to trigger a complete Logstash restart (or, if we want to be more subtle about it, somehow notify logstash to clear all of its caches).This needs some careful thought to figure out how to deal with it. On one hand, we know NetBox is using postgresql and redis to do its own database and caching: maybe we simply don't need to cache in LogStash at all and it won't be too expensive to just call the API every time? I don't know.