influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Feature Request - "Transform" Processor plugin #2667

Closed PauloAugusto-Asos closed 6 years ago

PauloAugusto-Asos commented 7 years ago

Feature Request

Requesting a "Transform" processor plugin.

I am trying to import Web access logs into InfluxDB with Telegraf. However, some of the URL PATHs include identifiers (session IDs, product IDs, etc). Ex: /products/cars/12345/view /shoppingBasket/1234567890/view

The URL PATH is being shipped as a Tag Value (obviously). I need to to be able to replace those identifiers from the PATH Tag Value before shipping the data to Influx (or whatever other DB) so that they become easily recognizable as the «same» URL PATH for searches and aggregations and to prevent an explosion of "series" in InfluxDB or Graphite.

Proposal:

[[processors.transformer]] tagpass = "ApacheLog" tagname = "path" matcher = "/products/cars/(\d+)/view/" matchertype = "regex" # "literal" replaceMatchedIndex = 1 # i0 being whole match. To replace *only* the ID replacement = "{CarID}" tagexclude = "ApacheLog" [[processors.transformer]] tagpass = "ApacheLog" tagname = "path" matcher = "/shoppingBasket/(\\d+)/view" matchertype = "regex" # literal replaceMatchedIndex = 1 replacement = "{SessionID}" tagexclude = "ApacheLog"

Simpler Proposal:

[[processors.transformer]] tagpass = "ApacheLog" tagname = "path" matcher = "/products/cars/\\d+/view/" matchertype = "regex" # "literal" # replaceMatchedIndex = 1 replacement = "/products/cars/{CarID}/view/" tagexclude = "ApacheLog"

SimplerSimpler Proposal:

[[processors.transformer]] tagpass = "ApacheLog" tagname = "path" replaceDigits = 3 # replace all sequences of X+ digits replaceGuids = true replaceTrimmedGuids = true # guids stripped of dashes tagexclude = "ApacheLog"

danielnelson commented 7 years ago

I like it. First version seems best, I think we need captures. I suggest we change a few names:

Some things to think about:

PauloAugusto-Asos commented 7 years ago

Hi Danielsan,

[...] transform tag names, fieldkeys, and field values. How should we select these?

We could select which element (field, tag, or?) we want to "transform" with: tagname = "myField123" Now that I think of it, "tagname" sounds incorrect, as it hints at InfluxDB "Tags" vs "Fields"...

Which transformations should we try to fit into this processor [...]?

I focused on this particular requirement of having to transform strings but you're right, there's many other transformation requirements that can be thought of.

You're right - if the only thing this transform plugin does is to transform strings, it should be named something more specific like "string transform" plugin.

Regarding enums, what type of "enum matching" are you thinking of? Could we do a string transform per Enum option? Each transform would try to grab a specific string and if it matches - adds/replaces/inserts the enum string?

Other requirements might be mathematical transformations, like "/60", though that doesn't feel really required (we can / should be able to do transformation math on the queries to the database).


Regarding the terminology (mind you I'm not a native English speaker):

matchertype -> transformation matcher -> regex_pattern literal -> replace

If we call the element "transformation" it seems to me unclear that the element is referring to how it's going to match/grab. Erm, I don't know whether we're actually thinking the same thing?

My reasoning was:

tagname = "myField123" ^ meaning what field to try to transform.

matchertype = "regex" or matchertype = "literal" ^ meaning how am I going to try to match if the transform should occur.

matcher = "regex_pattern|or|literal_string" ^ meaning this is what it's going to try to find inside "myField123".

replaceMatchedIndex = 1 ^ meaning it will try to replace only the regex-match-group n1 , instead of replacing the whole thing (or index 0 to replace the whole thing).

replacement = "{CarID}" ^ meaning to replace whatever was matched with this.

danielnelson commented 7 years ago

Regarding enums, what type of "enum matching" are you thinking of?

An example is mapping strings to ints "green" -> 0, "yellow" -> 1, "red" -> 2.

danielnelson commented 7 years ago

I think we should probably scope this to regular expressions, we can create separate processors for enums, type conversions, math, etc.

I'm also flip-flopping on backreferences, it seems like its not too bad to not have them or at least I'm not able to come up with a good example to justify the extra config complexity.

We could select tags/fields using subtables. Literal replacements could still be done with regex, so we wouldn't need a type option.

If we go with that an example config could be:

[[processors.regex]]
  namepass = ["apache"]

  [[processors.regex.tags]]
    key = "path"
    pattern = '/products/cars/\d+/view/'
    replacement = "/products/cars/{id}/view/"

  [[processors.regex.fields]]
    key = "path"
    pattern = '/products/cars/\d+/view/'
    replacement = "/products/cars/{id}/view/"
PauloAugusto-Asos commented 7 years ago

mapping strings to ints "green" -> 0, "yellow" -> 1, "red" -> 2.

That could be trivial, it seems: (what type each field is, is specified separately, right?)

[[processors.transformer]] tagname = "myStringField" matcher = "^green$" replacement = "0"

[[processors.transformer]] tagname = "myStringField" matcher = "^yellow$" replacement = "1"

[[processors.transformer]] tagname = "myStringField" matcher = "^red$" replacement = "2"


I think we should probably scope this to regular expressions Literal replacements could still be done with regex,

^ For simplicity that would probably be the best, I agree. There's nothing we can do with "literal" match-strings that we can't do with Regexes.

I'm also flip-flopping on backreferences

^ Not sure what "backreferences" are...

We could select tags/fields using subtables.

^ Why can't we just search for a «column» regardless of whether it is an Influx "Tag" or "Field"? If we can be abstracted from that, that would be ideal...

an example config could be: pattern = '/products/cars/\d+/view/' replacement = "/products/cars/{id}/view/"

^ Could we still have the possibility of specifying the match-group? That would allow us to replace only parts of the original string. Use case example: #Replace IDs - sequences of 3+ digits pattern = '.+(\d{3,}).+' replaceMatchedIndex = 1 replacement = "{id}"

danielnelson commented 7 years ago

On the enum/case example, this might be somewhat slow and somewhat verbose but perhaps it would meet the requirements. If we stick to string replacements you might need to follow it up with a type conversion, so that you get 0i instead of "0".

Not sure what "backreferences" are...

I was referring to captures groups and the replaceMatchedIndex here, more on that below.

Why can't we just search for a «column» regardless of whether it is an Influx "Tag" or "Field"?

It turns out you can have a tag and field with the same key: foo,value=bar value=42. It is a bad idea though, so maybe we shouldn't worry too much about it. Here is a mention of it in the docs, perhaps we should borrow this syntax? value::tag would specify the tag.

Could we still have the possibility of specifying the match-group?

Yeah I guess we should keep them. Perhaps backreferences in the replacement string could do the job:

pattern = '(.+)\d{3,}(.+)'
replacement = '\1{id}\2'
PauloAugusto-Asos commented 7 years ago

If we stick to string replacements you might need to follow it up with a type conversion, so that you get 0i instead of "0".

I honestly don't know this but: wouldn't we need to specify the type anyway, regardless? Meh - just thinking out loud - don't even bother answering me, you know better than me and I'm just raising the question.

backreferences in the replacement string could do the job: pattern = '(.+)\d{3,}(.+)' replacement = '\1{id}\2'

Ho, wow, and I thought I knew the gist of everything there was to know about Regexes... I had never heard of backreferences. Live and learn! That sounds really neat and it would solve the trick, indeed - you're right.

The only considerations I have about it are that the replacement would have to also be treated as a Regex (to access the backreferences), in which case we'd need to, for example, also escape the "{". replacement = '\1\{id\}\2' This might become a bit confusing/tricky, maybe catch people off guard thinking the replacement was a literal replacement.

Also, I'm wondering whether you can access the backreferences caught in the matching from the replacement in whatever Regex libraries (Go STD?) Telegraf is using.

Apart from those considerations I like the idea of backreferences - really cool feature of Regexes that I was unaware of.

danielnelson commented 7 years ago

wouldn't we need to specify the type anyway

I've just been thinking about this operating on strings so far. However, it would be possible to add a type option such as type = 'float' which would attempt to apply a conversion after transforming the string. If we do this we will have to consider what happens if the conversion fails.

The replacement string wouldn't be a regex, but would use the https://golang.org/pkg/regexp/#Regexp.ReplaceAll function to expand the replacement. I pasted the wrong syntax above, it looks like go format would be '$1{id}$2', you wouldn't need to escape braces, but to enter a literal $ you would use $$.

tbolon commented 7 years ago

This improvement could greatly simplify the tracking of IIS / aspnet apps, since they combine usage of IIS Site Name (text) and IIS Site Id (numeric) as tags, and manual mapping is necessary.

With such a feature, we could replace IIS Site Id in tags by IIS Site Name (per serveur) to ease the correlation of measurements.

danielnelson commented 7 years ago

@tbolon What plugin are you using to capture these stats? Can you give an example of the current and desired schema?

tbolon commented 7 years ago

Currently win_perf_counters.

Some counters are returned with an internal id. Exemple:

  [[inputs.win_perf_counters.object]]
    # IIS, ASP.NET Applications
    ObjectName = "ASP.NET Applications"
    Counters = ["Requests/Sec"]
    Instances = ["*"]
    Measurement = "iis_aspnet_app"

And the corresponding output:

C:\Program Files\Telegraf>telegraf.exe --config "d:\telegraf.conf" --test
* Plugin: inputs.win_perf_counters, Collection 1
> iis_aspnet,instance=*,objectname=ASP.NET,dc=azure-westeu,host=frontweb3 Requests_Current=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_1_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_14_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_15_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_3_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_4_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_5_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_6_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_7_ROOT,objectname=ASP.NET\ Applications,dc=azure-westeu,host=xxx Requests_persec=0 1502134454000000000
> iis_aspnet_app,instance=_LM_W3SVC_9_ROOT,objectname=ASP.NET\ Applications,host=xxx,dc=azure-westeu Requests_persec=0 1502134454000000000

"instance" tag value can vary from server to server based on the order the websites are created, so, before sending them to influxdb, I could prefer to have a way to transform them to use a better name.

I only need a bunch of harcoded replacements in my telegraf config : "_LM_W3SVC_9_ROOT" => "SomeWebsite", etc. These ID will never change (unless you delete/recreate websites).

I can't do such a thing on my dashboard, since the "_LM_W3SVC_9_ROOT" id can map to different sites based on the host.

Other performance counters are already using IIS Site name as instance name:

  [[inputs.win_perf_counters.object]]
    # IIS, Web Service
    ObjectName = "Web Service"
    Counters = ["Current Connections"]
    Instances = ["*"]
    Measurement = "iis_websvc"

It will give the following output (mostly redacted):

C:\Program Files\Telegraf>telegraf.exe --config "d:\telegraf.conf" --test
* Plugin: inputs.win_perf_counters, Collection 1
> iis_websvc,instance=MyWebsite1,objectname=Web\ Service,dc=azure-westeu,host=xxx Current_Connections=0 1502134786000000000
> iis_websvc,dc=azure-westeu,host=xxx,instance=MyWebsite\ 2,objectname=Web\ Service Current_Connections=0 1502134786000000000
...

I hope this helps.

fenneh commented 7 years ago

This would also be useful for the.Net Data Provider for SqlServer perf counters as they will include the PID or the .Net process connecting to SQL within the counter. Upon a service restart, you'll get a new PID and lose that link between the metrics.

MonkeyDo commented 6 years ago

I am also interested in this feature. My use case is to strip part of incoming mqtt topics with a regexp in the mqtt_consumer plugin.

danielnelson commented 6 years ago

Will be included in 1.7, thanks to @44px!

I encourage everyone to give the regex processor a shot before the release.