Open hhromic opened 2 years ago
👍 that is a neat idea. I think, until we have a better situation for event metadata, that we'd want to introduce another schema field for this. We'd likely want to also make it opt-in so users aren't surprised by the new field when sending to sinks that have strict schemas.
As a separate issue, that we are working on aspects of, there really shouldn't have been such a performance regression adding the remap
transform there.
👍 that is a neat idea. I think, until we have a better situation for event metadata, that we'd want to introduce another schema field for this. We'd likely want to also make it opt-in so users aren't surprised by the new field when sending to sinks that have strict schemas.
That sounds like a good approach!
As a separate issue, that we are working on aspects of, there really shouldn't have been such a performance regression adding the
remap
transform there.
This is how I tested this. First, I created some random data for testing:
cat /dev/urandom | tr -dc 'a-z A-Z' | tr ' ' '\n' | head -n 5000000 > data.test
Then, the baseline configuration is this:
api:
enabled: true
sources:
logs:
type: stdin
sinks:
blackhole:
type: blackhole
inputs:
- logs
Which I run like this with the testing data:
while true; do cat data.test; done | docker run --name vector --rm -i -v $PWD/vector.yaml:/vector.yaml:ro timberio/vector:0.20.0-distroless-libc -c /vector.yaml
On my computer, vector top
reports an average of 88-90K EPS for the blackhole
sink.
Now, the labelling configuration using remap
is this:
api:
enabled: true
sources:
logs:
type: stdin
transforms:
label:
type: remap
inputs:
- logs
source: '.label = "SOME"'
sinks:
blackhole:
type: blackhole
inputs:
- label
Which on the same computer, vector top
reports an average of 82-83K EPS for the blackhole
sink.
Ah, yes, sorry, I should have made it clearer in my comment that there is definitely known overhead in adding a remap
transform; it is just larger than we would like for it to be. We are making improvements there though. For example rewriting VRL's interpreter as a VM.
That particular case could likely also be aided by removing one clone that happens due to VRL being an expression-based language so that .label = "SOME"
itself returns a cloned "SOME" that in this case is being discarded. @JeanMertz do you know if we have an issue tracking that? I can't find anything.
Thanks for the perf reproduction case though 😄
Ah, yes, sorry, I should have made it clearer in my comment that there is definitely known overhead in adding a
remap
transform; it is just larger than we would like for it to be. We are making improvements there though. For example rewriting VRL's interpreter as a VM.
Ah yes, I have been looking forward to see that released!
Btw, I just remembered that there is an add_fields
non-VRL transform that can be used as well:
sources:
socket-8080:
type: socket
mode: tcp
address: 0.0.0.0:8080
socket-8081:
type: socket
mode: tcp
address: 0.0.0.0:8081
transforms:
label-8080:
type: add_fields
inputs:
- socket-8080
fields:
source_id: socket-8080
label-8081:
type: add_fields
inputs:
- socket-8081
fields:
source_id: socket-8081
sinks:
console:
type: console
inputs:
- label-*
encoding: json
While the boilerplate code is pretty much the same, at least performance doesn't degrade that much as with remap
:)
I just tested this using the configuration in my last comment and got 86-88K EPS using add_labels
.
I do know that those transforms are getting deprecated and hence they are not documented anymore.
But for now I guess is good enough for us until the feature requested here manages to be implemented (if decided so).
After developing our use-case more, I actually think that a better feature request would be the ability to add a customisable "label" to each event emitted by a source. So instead of a source_id
field, a configurable source_label
one for example. It could be provided at the source configuration like this:
sources:
udp-socket:
type: socket
mode: udp
address: 0.0.0.0:1514
label: DATACENTER-1
tcp-socket:
type: socket
mode: tcp
address: 0.0.0.0:8080
label: DATACENTER-2
If the label
config is not provided (could be the default), then no source_label
field is added and becomes opt-in.
The reason for this customisability is that we don't want to name our sources after the labels, which are placed inside the event to the next processing component instead.
👍 I wonder if we could just generalize this further and let you attach a VRL program to sources. It'd really just be syntax sugar over chaining a remap
transform afterwards though. Granted, as you noted, we still need to improve the performance of remap
over native code (it'll never be quite as fast, but can be closer).
Hehe I wouldn't "over do it" either :) As you say, it would be syntax sugar and I don't think simple things like labelling the source warrants a remap interpreter. I would keep it simple as sources should be as performant as possible imho :)
The labeling solution would help my use case too. Another option would be to always include the metadata that is already collected when aborted/asserted. If we could opt in on the metadata to be always put in the message schema.
"metadata": {
"dropped": {
"component_id": "transform_name",
"component_kind": "transform",
"component_type": "remap",
"message": "function call error for \"assert\" at (29:86): custom assert message",
"reason": "error"
}
Any progress on this? I'm currently trying to add common metric log_last_timestamp{source="$source$"}
to make further alerting on failed synchronizations between file events and destination system events (elastic).
Nothing yet. The workaround is pretty straight-forward though: add a remap
transform after each source you want to add a label.
I have a * mask. I grab all sources, so remap will not work for me, cause I don't know them beforehand.
A note for the community
Use Cases
We need to identify from which configured source an event comes from for labelling and/or routing. In our main use case, we have two socket sources configured on different listening ports and we need to add a label-field indicating from which port the event came from. This can extend to more than two in some other use cases.
In other use cases, we need to route events depending on which source they come from.
Attempted Solutions
At the moment, Vector does not provide any field to identify the source from which an event comes from. It only provides a
source_type
source-related field that is not useful. For example:However, we can add
remap
transforms wired to each source to label the events and then merge. For example:Which produces the following output when sending sample data to each port:
While this approach works, it has two drawbacks:
Proposal
Instead of the approach described above, we think that the simplest solution would be to add a new
source_label
field to the event, similar to the existingsource_type
field, containing a configurable label of the source that emitted the event. For example:See https://github.com/vectordotdev/vector/issues/11544#issuecomment-1050709550 for an implementation idea.
This would make trivial to identify the source in remapping transforms together with other remappings necessary.
References
No response
Version
vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)