Indices' size too big (?)

nickcn commented 6 years ago

Hello,

We have been using your program at the company I work for, for a couple of days now, gathering NetFlow data from our core router. The thing is, I see the data growing at a rate of ~1GB/hour, which, having not engaged with elasticsearch before, seems a lot to me. Is this behavior normal? I can provide all sorts of logs and data if needed.

Thank you

robcowart commented 6 years ago

The size of the indicies will really depend on the volume of flows that are being ingested. Flow data is notorious for producing huge volumes of data. In addition all of the enrichment adds to the size of each record that is stored.

Which version of Elasticsearch are you using? (v6.x is more efficient storing data than v5.x)
How many flows are you receiving per day?
Are you using a multi-node cluster? (by default ElastiFlow will create 1 replica which doubles the volume of data, but gives you redundancy)

Run this query against Elasticsearch... curl -XGET -u USER http://ES_HOST:9200/elastiflow-*/_stats?pretty

Near the top of the results you will see something like this...

"_all" : {
  "primaries" : {
    "docs" : {
      "count" : 2032971,
      "deleted" : 0
    },
    "store" : {
      "size_in_bytes" : 2485253389
    },

This example is actually from the more advanced solution I offer, and you can see that 2M flows are consuming 2.5GB. This is 1.25KB per document (i.e. per flow record). ElastiFlow will probably be a little smaller than that depending on the details of the flow records your devices are sending.

Scrolling down a little further you should see something like this...

"total" : {
  "docs" : {
    "count" : 4065941,
    "deleted" : 0
  },
  "store" : {
    "size_in_bytes" : 4430786342
  },

Here you see the values essentially double because of the additional replica that is also stored. You will only see this difference if using a multi-node cluster.

ElastiFlow normalizes the various flow types (Netflow, sFlow and IPFIX) to a common model under the flow object. However, the original raw data is also kept. Since all of the dashboards are based on the flow object, the original (and duplicate) flow record data could be deleted, to save storage space, without any loss of dashboard functionality. This could be added easily as a configurable feature.

That said, network flow data is always going to involve really large data volumes. I was recently talking with a Solutions Architect from Splunk, who admitted to me that almost no one uses Splunk for network flows because of the massive data volumes produced, which gets really really expensive with their volume-based pricing model. ElastiFlow and the Elastic Stack offers a scalable alternative to such commercial solutions. However, you still have decide how much data do you want to keep and for how long.

robcowart commented 6 years ago

I added support to optionally remove the original flow record fields if storage space is a concern. It is available in release 2.1.0 which you can find here... https://github.com/robcowart/elastiflow/releases/tag/v2.1.0

Added the option to remove fields from the original flow records to save storage space. This is done by setting the environment variable ELASTIFLOW_KEEP_ORIG_DATA to false (default is true). The result of setting this to false is that the netflow, ipfix and sflow objects will be removed prior to sending the data to Elasticsearch. This has no adverse affect on the provided dashboards, as they they are populated from the normalized flow object. However the original flow fields will no longer be available if they are desired for additional analytics.

nickcn commented 6 years ago

Thanks for adding support for this. One final question though. Will I have to delete the old data manually or setting the variable will take care of everything?

robcowart commented 6 years ago

Your last comment came in as I was typing...

No. Don't touch the data files on disk at all! That is a sure way to corrupt all of your indicies!

When I spoke of removing the raw fields, I meant removing them at the end of Logstash's processing of the data as it is collected. That is the option I have now given you in the 2.1.0 release I mention above.

There isn't much you can do to the data you already have with out reindexing it, which is a much more complicated topic. If you want to simply delete the older data, you would do that via the Elasticsearch REST API. ElastiFlow writes daily indexes, so you can delete a days worth of data by simply deleting the index such as...

curl -XDEL http://ES_HOST_IP:9200/elastiflow-2018.02.11

That said, there is no reason to delete the data unless you absolutely must reclaim the disk space. If you want to automate deleting old data, I recommend you look into Curator for Elasticsearch.

If/when you move to 6.x you should see a little bit of improvement in storage requirements as improved handling of sparse doc values and removing the _all field, both contribute to less space being used. Your current storage requirement per record is only 750 bytes (which isn't bad), and that can get smaller if you choose to not keep the original fields.

hexicans commented 4 years ago

Hello, how to add this rule if we use docker ? : ELASTIFLOW_KEEP_ORIG_DATA

robcowart / elastiflow

Indices' size too big (?) #44