Enrichment table regex type support

srstrickland commented 9 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

I would like to be able to configure some regexes in an enrichment table, and access them without having to call to_regex() on a enrichment field string every time (which vector warns me is expensive and will affect throughput). If the data in the enrichment table could live as a regex in the first place, then I wouldn't have to do that.

I have several types of rules that I want to manage based on configuration. The simplest is a set of rules for dropping incoming logs based on some regexes. I would like to define a CSV with three columns: log-group (string), field-name (string), match-regex (regex). As logs come in, I look up the rules based on the log-group of that record, and if any of the configured fields match the corresponding regex, I drop the record. We get custom requests from teams (owners of particular log groups) to drop certain records based on basic field matching, and this would be a convenient way to express those without having to modify code every time (instead it's just a block of generic code looking at rules).

Attempted Solutions

Write specific code for every log group (this becomes cumbersome to manage and error prone)
Use enrichment tables to manage the rules, but coerce the regex string to a regex via to_regex() on every use, which is expensive.

Proposal

Support a regex type in the enrichment table schema which handles the creation of a regex once.

References

No response

Version

vector 0.35.0 (x86_64-unknown-linux-gnu e57c0c0 2024-01-08 14:42:10.103908779)

jszwedko commented 9 months ago

👍 thanks for opening this @srstrickland . I agree that the schema could be updated to allow for regular expressions to be compiled when Vector starts up.

johnhtodd commented 8 months ago

I very much like this idea. Can you provide an example of how this would look in the way you envision it, in both a config file and in your code logic? I'm not sure I'm following the example. I, too, need regexp matching in enrichment files, but I was thinking an enrichment file that looked more like this (we do DNS work so I'm going to use DNS examples.) The first line of the CSV file is the descriptions of the fields, as usual.

/etc/vector/domain-match.regexps:

matchDomainRegexp,ruleIdentifier
^(www|webserver|webhost|web\d)\.,website
^(smtp|post|mail\d)\.,mailserver

Then the enrichment configuration and transform VRL would look like this:


enrichment_tables:
  domain_match:
    type: "file"
    file:
      path: "/etc/vector/domain-match.regexps"
      encoding:
        type: "regexp"
    schema:
      matchDomainRegexp: "regexp"
      ruleIdentifier: "string"

transforms:
  . . .
  .DNSRecordCategory, err = get_enrichment_table_record("domain_match", { "MatchDomainRegexp": .responseData.question[0].etld.etld_plus })   
  . . .

So, if .responseData.question[0].etld.etld_plus was equal to "www.example.com" then the result would be:

"DNSRecordCategory.ruleIdentifier": "website"

There are other questions, like "what if multiple rules match"? I'd suggest that the first one in the config file would win. I don't see how it would be possible to index a set of regular expression rules, and iteratively running through the whole list of rules to find ALL matches and then find the most specific (if that can even be defined!) seems somewhat expensive and un-necessary.

I suspect that regular expression rules would have to be a whole new type of enrichment, since they don't have the indexing component.

srstrickland commented 8 months ago

Ah... So to clarify, all I wanted to do was return (already compiled) regex types as part of an enrichment record. So it changes nothing about how things are keyed or queried; just that one of the columns may be an already-compiled regex. Lookups are still based on simple matching.

My example would look something like this:

filter_rules.csv:

group,field,regex
foo,message,.*whatever$
foo,kubernetes.cluster,^dev-.*
bar,message,.*drop-me.*

An incoming record has group=foo, and has fields message and kubernetes.cluster. In VRL I want to get all the rules for group foo, and if any of the indicated fields match the given regex (i.e. in this case, if message ends in whatever or kubernetes.cluster starts with dev-), then I want to drop the record. No lookups are happening against the regex field; I simply want already-compiled regex types to be returned so I don't have to compile them and take a performance hit.

It sounds like your use case is about configuring a regex as a lookup mechanism, which is a lot more complex, and potentially very expensive since every lookup may have to run against potentially all of the records in the table. I don't know of a way to index regexes in a way that would allow you to efficiently find the first match, without trying all of them (breaking when you find the first match is still O(n)). So unless some magic can be applied in the lookup phase, I don't think it makes sense for vector to expose regex-based lookups, since under the covers it would still be a big performance hit. Probably better to let users do their own iteration, and effectively have control over whether they stop at the first match or keep going (or somewhere in between). Hiding this behind an API just gives a false sense of performance.

jordant commented 3 months ago

+1 for this feature, it would be incredibly useful when trying to determine if web traffic is a bot based on user agent strings (which require regex matching)

vectordotdev / vector