logstash-plugins / logstash-filter-dns

Apache License 2.0
7 stars 28 forks source link

allow concurrent queries when cache enabled #42

Closed robcowart closed 6 years ago

robcowart commented 6 years ago

When caching features are enabled, queries are forwarded to the name server syncronously. The result being that slow returning queries slow the processing of all data.

To evaluate the effect of these changes I extracted 5 million IP addresses from the logs of a large firewall, which ensures that the mix of IPs is similar to what would be seen in the real-world. I then repeated these 5 million IPs to have a file with 10 million addresses. The data was then processed locally on my Macbook Pro, with a DNS server on the local LAN which forwards the necessary queries to 1.1.1.1 and 8.8.8.8.

To establish a baseline these 10M IPs were passed to Elasticsearch through the following filters:

filter {
  mutate {
    rename => { "message" => "ipaddr" }
    remove_field => [ "path", "host", "@version" ]
  }
  mutate {
    add_field => { "hostname" => "%{ipaddr}" }
  }
}

The result was approx 41000-42000 eps.

screen shot 2018-05-05 at 17 48 24

Next a dns filter was added without caching.

filter {
  mutate {
    rename => { "message" => "ipaddr" }
    remove_field => [ "path", "host", "@version" ]
  }
  mutate {
    add_field => { "hostname" => "%{ipaddr}" }
  }
  dns {
    reverse => [ "hostname" ]
    action => "replace"
    nameserver => "192.168.1.1"
  }
}

Knowing this would be slower I switched to using only the first 100K entries. The results plummet to ~450 eps.

screen shot 2018-05-05 at 17 52 24

Next caching is enabled...

  dns {
    reverse => [ "hostname" ]
    action => "replace"
    nameserver => "192.168.1.1"
    hit_cache_size => 8192
    hit_cache_ttl => 900
    failed_cache_size => 2048
    failed_cache_ttl => 900
  }

The results are much more inconsistent as queries that can not be answered by downstream servers block other queries until a timeout occurs.

screen shot 2018-05-05 at 18 17 59

NOTE: You will notice that things improve in the later part of the test. This is thanks to #41 which was added in the last release. The latter part of the data has a lot of IPs that would timeout, and you can see here that #41 now caching these has a positive impact.

I then applied this PR and the improvement with the first 100K was obvious. The peak was over 2000 eps.

screen shot 2018-05-05 at 18 56 34

I then switched back to the 10M record file. Remember, this is the same 5M records twice.

screen shot 2018-05-05 at 19 28 54

Here you can see how the performance continues to improve as the cache fills. However I was a bit disappointed in the results. From the two spikes, one as high as 40,000 eps, you can see the performance promises to be very good if the query result is already cached. While 10-15,000 eps is a huge improvement over the 450-500 eps we were seeing, when the 5M records are repeated in the second half of the file, I would have expected better results.

I suspected the results were being ejected from the cache as it filled, so I increased the size of the cache and retested:

  dns {
    reverse => [ "hostname" ]
    action => "replace"
    nameserver => "192.168.1.1"
    hit_cache_size => 131072
    hit_cache_ttl => 900
    failed_cache_size => 131072
    failed_cache_ttl => 900
  }

With plenty of cache for all results in this real-world data sample, the advantage of the results being in the cache is clear.

screen shot 2018-05-05 at 19 46 06

The original data without any DNS filter was 41,000 eps. With the DNS filter and a warmed up cache, 40,000 eps was possible. This indicates that the additional overhead is minimal.

With this PR I finally feel comfortable that I can use the DNS filter in production deployments without sacrificing ingest performance. This is a huge win for producing user-friendly data!

robcowart commented 6 years ago

I wanted to add that this has been running at one of my customers for 4 days without issue. Event rates peak as high as 2500/sec with 3-5 name lookups per event, and DNS has not been a bottleneck.

It would be good to get this merged, as name resolution is a high-value feature for many logging and network flow related use-cases.

robcowart commented 6 years ago

Updated coding style based on feedback. Tested change on a live system prior to commit, and all still works as expected.