logstash-plugins / logstash-filter-http

HTTP Filter Plugin for Logstash
Apache License 2.0
12 stars 29 forks source link

Implement native caching for higher scale lookups #10

Open acchen97 opened 5 years ago

acchen97 commented 5 years ago

There have already been some demand for native caching for HTTP lookups with this plugin. This would help enable higher throughput without the need for usage with conjunction with third-party caching systems like Memcached.

Please feel free to +1 if you are interested in this feature.

yaauie commented 5 years ago

I envision a two-part solution:

  1. Support for proxies (including https) would be trivial to add, and would allow users to configure a local caching proxy (e.g., Squid Cache) that obeyed all of the semantics and standards of the web and kept that complexity out of our maintenance domain.
  2. A naïve LRU in-memory cache (perhaps around LogStash::Filters::Http#request_http(verb, url, options)) is also possible if a little more complex, and would reduce the overhead of a user of this plugin configuring and running above-mentioned caching proxy, at the cost of breaking some of the semantics (e.g., no upstream cache invalidation) and some unpredictability in the plugin's memory consumption.
telune commented 4 years ago

I add my vote on this one, it would be ideal for our data enrichment use case. We are now using the jdbc_streaming filter, but it's a less-than-ideal choice. The perfect choice would be the http filter with caching capabilities, just like the aforementioned jdbc_streaming, only making HTTP calls instead of SQL queries.

grownuphacker commented 4 years ago

+1 Just came to add my interest in this. I haven't gotten any method other than hammering my REST source with the exact same request to work.

vjt commented 4 years ago

-1

I don't think that LogStash should have a caching layer, as there is already external software (nginx, memcached) that does that well and it's easy to integrate them with LogStash.

I have two use cases for which I am using external caches:

That said, I find the following pluses in having the caching layer external:

Sorry for the verbosity, I hope this is useful also for your use cases.

[1] local caching proxy

proxy_cache_path /srv/cache/foobar levels=1:2 keys_zone=foobar:40m inactive=24h max_size=1g;

server {
  listen localhost:8084;

  access_log off;

  location / {
    proxy_pass            https://foobar;

    proxy_ignore_headers  Cache-Control;

    proxy_set_header      Host foobar.example.org;
    proxy_buffering       on;
    proxy_cache           foobar;
    proxy_cache_key       $uri$is_args$args;
    proxy_cache_valid     200 404 1h;
    proxy_cache_valid     any 5m;
    proxy_cache_lock      on;
    proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504;

    add_header X-Cache-Status $upstream_cache_status;
  }
}

upstream foobar {
  server foobar.example.org:443;
}

[2] memcached enrichment

# We have a mapping from the event, store it in the cache for usage by other future events.
#
if [clientip] and [user] and [user] !~ '(?:^(?:unauthenticated|_?system|anonymous|\[?unknown\]?)$)' {
  memcached {
    hosts => ["cache-01"]
    namespace => "logstash-ip"
    set => { "[user]" => "%{clientip}" }
    ttl => 86400 # Avoid stale lookups
  }
}

# We don't have a mapping from the event, try to look it up from the cache.
#
if [clientip] and ! [user] {
  # Check the cache
  #
  memcached {
    hosts => ["cache-01"]
    namespace => "logstash-ip"
    get => { "%{clientip}" => "[user]" }
    add_tag => ["user_from_cache"]
  }
}