Open MadsRC opened 1 year ago
Hi @MadsRC,
I love this idea ! I'm using a lot this kind of feature with Logstash and I'm quite sad it's currently not possible with Vector. In your VRL example you're talking only about a "getLookup()" function to query Redis (or others) and enrich events. But it woud be nice to be able to push data into the cache from events also, wouldn't it ? With some "addLookup(.userId, .username)" for example.
On the technical perspective, Redis supports well multi threaded clients and async operations. So that it look quite compatible with Vector and remap task. But I'm thinking about the fact the remap tasks are supposed to be stateless. In that case the state of the connection with the backend cache must be kept somewhere. It cannot be in the remap transform itself, so the connection with the remote cache must be managed somewhere else globally. With maybe the issue of sharing this connection object with all remap tasks. Might be challenging.
So, maybe creating a new stateful transform would be more viable than creating a new VRL function for this case ? A "remote_enrich" transform which could support get/set key/values in various remote network locations.
@peacand - You are right, it would be nice to be able to push data - It was supposed to be part of the issue, but unfortunately it slipped my mind. I've added it now.
I wouldn't be opposed to having it as a separate transform. That may even make it easier to implement the various methods/functions of the remote network location, as you could make a transform per implementation (ie, one for redis, one for memcached etc).
We have thought about adding these back-ends to Vector's enrichment_table
feature though do need to figure out how to best model it so that it's clear that significant I/O latency could be being introduced.
@MadsRC is there a specific back-end you are most interested in? It sounds like Redis? We have a separate issue tracking SQL support already: https://github.com/vectordotdev/vector/issues/17181
I would say Redis/Memcached caches are designed and optimized for very fast access and low latency response. Much more than SQL. I personally prefer Redis over Memcached. About I/O latency, I don't know about Memcached or SQL, but Redis supports well async operations, which may introduce latency in events complete processing but should not block Vector pipeline.
@jszwedko thank you for your great work on Vector ;)
My preferred backend would be Redis. Another potential backend would be a generic HTTP backend (via GET and POST) that allowed for integration with inhouse systems - but that's mostly a nice-to-have ;)
Reading this again, this does feel like a bit of a different use-case than enrichment tables serve. I was going to roll it up into a general issue to add remote back-ends to enrichment tables, but will leave this open as a separate issue to allow arbitrary key/value setting/fetching from Vector.
As a workaround, users can fall back to using a lua
transform. Lua seems to have clients for redis and memcached.
having redis lookups as an enrichment source would be really beneficial for me as we use CSV enrichments heavily at the moment across multiple servers. Keeping the CSV's up to date on all nodes can be a pain! It would also be nice to lookup and cache the value locally for a TTL to remove a lot of the latency for the external call. That framework could be extended to the DNS lookup logic also :)
thanks for a great product.
Have we decided on a model or approach to solve this? and what would be the feature extent for this?
I'd love to help getting this added if contributions are accepted....
(I'm looking for a cassandra backend).
A note for the community
Use Cases
Log Enrichment
Vector source produces a log:
A VRL remap transform queries enriches the log, using the
userID
, with data from an external system:Calculating user logins
Vector source produces a log:
Vector VRL remap transform increments counter in external system using
userID
as the key and gets the new total. Anif
statement is used to check the new total (which would be total logins over a period of time) against a threshold, and determines if it should produce a new message to a destination (using the routing functionality of Vector) to notify that something bad is happening.Attempted Solutions
Data Enrichment can currently be done by hard-coding the enrichment data into VRL. While this is arguably faster than making several network calls to get the data, it is not very scaleable or dynamic.
Proposal
I would like to see VRL, and by extension Vector, support looking up values, and potentially setting values, in a remote system, such as Redis, Memcached or maybe a Relational Database of sorts.
On top of allowing for data enrichment, this would also allow one to use VRL/Vector as a proper detection engine. While one can already use Vector/VRL for simple detections, having the ability to reference a remote state of sorts would allow for some cool event correlation use-cases.
It is pretty common to have some sort of pipeline in front of a large, expensive enterprise SIEM system like Elasticsearch or Splunk. If one used Vector in this pipeline, and Vector supported get'ing and set'ing values in remote systems, one could offload some of the costs of these enterprise systems by doing real-time detections while one is processing the data anyways.
I am imagining a VRL syntax like this:
or
and then setting connection info for the
getLookup
andsetLookup
function in the global settings.Alternatively, supporting specific clients could also be in scope, so that one could use some of the more specialised functions of the lookup store, such as Redis's
INCR
function.An example of where Redis's
INCR
function would be helpful is in the use-case of tracking amount of failed logins:The above example would use the
incrLookup
function to increment a counter in Redis by 1, where the key is the value of thesrc
field. The function would then return the new number (or alternatively, one could call getLookup or similar on the value) which is stored infailCount
. The value offailCount
is then used in a route transform to determine if it should forward the event somewhere.References
No response
Version
No response