Does this plugin able to work on over 500K data sources?

ravinallappan commented 6 years ago

We are currently evaluating cacti and this plugin for our environment, which will scale up to over 500K of data sources.

Providing Cacti server hardware sizing is sufficient. Can we expect this plugin to work on this environment without a problem? Will the fact that it is writing to a single log file may have negative impact on functionality/performance?

Any gotcha we should be aware of? Your feedback is much appreciated and saves us lots of time in long run. Thanks.

netniV commented 6 years ago

This plugin replicates every piece of data that is recorded in cacti over to the splunk database via the mirage plugin. Any log file should be placed into a rotation via either cron or its own code. For example, cacti moved from relying on the system to rotate logs, to using its own rotation code.

ravinallappan commented 6 years ago

Noted on file size and rotation. Thanks.

With that, 500K data sources should be manageable? How much percentage of CPU/Memory overheads should we expect for using this plugin?

netniV commented 6 years ago

That I cannot answer as I am just a lowly user. I did have this running whilst I was testing splunk but I could not get my graphs the way I wanted in there so I have abandoned that project for now in favour of other more urgent ones. I only had 100-200 data sources though so my impact was probably minimal.

n00badmin commented 6 years ago

The plugin is currently running in at least a few major production environment, the one I am familiar with, consisting of 3 Cacti poller instances that have upwards of 70K data sources each, on very modest hardware. No known issues.

In general, based on cacti forum numbers and advice, a Cacti instance should be good up to ~100K data sources per poller before you start to get into rare air. I have seen claims of multiple hundreds of thousands on a singler poller, but I am not sure how many you will want to stick into a single poller.

Are you planning to take advantage of the remote polling features and spread the polling over multiple instances?

I will see if some of the Splunk community users can chime in on their per poller numbers.

I don't see dumping the poller cache to disk in a single file being an issue, as this is what Spine does best, dumps its output to a temp table where it then is flushed into rrds. I would expect if it can handle that, then putting it down to disk in a flat file should be all good.

netniV commented 6 years ago

Just a FYI, I know there is a lot of work going into improving multiple remote pollers for 1.2, which I believe is not far off too.

n00badmin commented 6 years ago

@netniV I have opened a issue to add Splunk Metric Store support. Will make Splunk graph creation/experience much easier.

Glad to help you get over any hurdles you hit once you have bandwidth to look at it again.

netniV commented 6 years ago

That's good. I think that was someone else's issue (#11) though, not this one ;-)

I did try to apply to that slack channel you'd suggested previously, but never got a response. I'll have to get another splunk trial going to start testing again. Think my issue was #9 (unless I'm not understanding the whole metric store issue)

n00badmin commented 6 years ago

Ah, weird, I'll check on the approvals process for you!

I was referencing the metric store, because it will vastly simplify getting the graphs working, and it is relevant here because as you scale up, you'll want the perf the metric store brings over having to use event based index.

netniV commented 6 years ago

OK Cool. Looks like I need to look into that too :)

apharas commented 6 years ago

We are running (3) production cacti instances all sending data via the mirage plugin to Splunk. Individual cacti graph counts are about 150k, with one of the three pollers doing 90k by itself. Spine is the poller of course, and we keep the total poll time even on the largest instance averaging ~120 seconds for 5 minute polling periods. We have been deployed for around 2 years, and had been using cacti previously within Splunk by turning on the debug mode to log all results. Mirage is a huge improvement over that method but goes to show that as long as the data is available it can be graphed.

What I will say though is graph count is not the only thing to be aware of since the number of items per graph is also important. We graph a number of additional items such as errors and discards all on the same datasource/graph in cacti for ease. This brings the number of unique OIDs being graphed to ~ 450k. We also do not use the cacti web interface except to admin the system, no users have access and we don't even bother to put any graphs in the device tree view.

As far as license usage goes on Splunk we are able to graph all of that data, as well as the cacti application logs using ~5G of data a day. We create 30 minute summary data as well as hourly, 4 hour, daily, etc just like an RRD would but don't throw away the actual data in case we need it in the future for more accuracy. An added bonus of having the data with timestamp is we are able to fill in data even if a poll is missing, or work around slightly slow device, by using the difference in time between successful polls instead of assuming that everything happens exactly at 300 second intervals.

Guess after that long winded post I should just have said "Yeah if you have correct number of pollers for Cacti, Splunk will handle the data." But we don't use the remote poller feature as we have been deployed with Cacti well before that was a normal config option.

ravinallappan commented 6 years ago

Thanks @netniV, @n00badmin @apharas.

From your inputs so far, I gathered its advisable to keep one cacti poller to ~100K.

Above this point, best to move to remote polling or complete separate cacti instances.

When I was looking at remote pollers with Mirage, I came across this discussion.

https://forums.cacti.net/viewtopic.php?f=21&t=talks

When the requester mentioned about backfill, was he referring to scenario when Main Poller is up again after been offline for awhile and Remote poller backfill the data? Does this details gets into mirage log? And a more general question, does mirage works well in an environment with remote pollers?

n00badmin commented 6 years ago

Hey just a note on file rotation, we built the option into the plugin, so it should be there already. Let me know if there are any issues.

small update. I am prepping tests with distributed pollers. Found a docker container for cacti, so going to play with that.

I have no reason to believe mirage wont work as expected in remote poller setups. Will test this theory.

As for backfill, I am not certain, the link you provided is broken.

Early tests of the updated plugin have gone really well and we have eliminated the need for lookups by moving the event enrichment to the collection layer. We are now just gathering the info as part of putting data down to file with mirage. This allows for much easier integration into Splunk and will let us get some awesome features as Splunk matures the metric store.

n00badmin / mirage

Does this plugin able to work on over 500K data sources? #10