Closed bewing closed 2 years ago
hi @bewing , thanks for opening this issue.
Right now, the event-merge
processor only works:
prometheus-output
with gnmi-cache: true
docs here # a boolean, if set to true, the received gNMI notifications are stored in a cache.
# the prometheus metrics are generated at the time a prometheus server sends scrape request.
# this behavior allows the processors (if defined) to be run on all the generated events at once.
# this mode uses more resource compared to the default one, but offers more flexibility when it comes
# to manipulating the data to customize the returned metrics using event-processors.
gnmi-cache: false
In your case, each leaf is sent in a separate SubscribeResponse, so the responses need to be cached before a metric is written to the output.
I believe you are using a file output just for testing purposes ? Do you know which output you will ultimately use?
I will make sure to document all this a bit better.
In your case, each leaf is sent in a separate SubscribeResponse, so the responses need to be cached before a metric is written to the output.
I believe you are using a file output just for testing purposes ? Do you know which output you will ultimately use?
That is correct, I was using file output for iterative testing (same reason I only subscribed to a single interface). I was planning on outputting to InfluxDB, so seeing a generic approach to this (instead of the utility of only using it for scraping) for vendors with aggressive SubscribeResponse generation strategies are better supported. If not feasible, I will look at other options, so no worries.
I will make sure to document all this a bit better.
I much appreciate all the work you have done and continue to do!
The same caching method can be used for influxDB (I will need to write it), so if you are interested I can work on it.
The difference with prometheus
will be that writing to influxDB will happen at regular intervals ( instead of waiting for prometheus to send a scrape request).
The write interval will have to be long enough to make sure all the leaf values arrived, i.e write_interval
> sample_interval
Maybe doing that processing/augmentation is better to happen in TSDB. Using a collector-side caching will cause timestamp alteration (original TS will be rewritten with a TS of a merge event), which might be undesirable.
This makes me thinking that maybe a cool hack would be to add an optional facts-collection step before the subscriptions start to collect certain information about the target and store it in mem for the sake of flexible augmentations of an event.
For example, we collect the facts from that eos target and maintain the JSON in an interface{}
struct.
Then a processor called event-augment
can refer to this in-mem stored structure with a jq accessor, i.e.
processors:
augment-desc:
event-augment:
key: description
value: 'some jq expression to access description field for a target-specific in-mem stored json'
Maybe doing that processing/augmentation is better to happen in TSDB. Using a collector-side caching will cause timestamp alteration (original TS will be rewritten with a TS of a merge event), which might be undesirable.
The unfortunate part here is if you need the events merged prior to doing any tag processing, you cannot add tags to already written readings in InfluxDB (I am not sure about Prometheus). The timestamp argument is important, however -- if you are looking for extremely accurate readings, this would cause an issue. There's also the problem of timing -- I am not sure how you could guarantee that the correct readings are all grouped together, or that none would be missed, especially with multiple targets, as over time you might have a device restart its gNMI server, and there's the possibility of the windowing falling in the middle of a device emitting readings. Maybe something that could be addressed in a kafka streams pipeline?
For example, we collect the facts from that eos target and maintain the JSON in an
interface{}
struct. Then a processor calledevent-augment
can refer to this in-mem stored structure with a jq accessor, i.e.processors: augment-desc: event-augment: key: description value: 'some jq expression to access description field for a target-specific in-mem stored json'
I actually used this approach in telegraf's gNMI plugin to solve this problem for Arista -- there are gNMI subscriptions we have just to maintain metadata and attach to subsequent readings. Using ON_CHANGE subscriptions keeps the data from going stale
The resulting merged message will have the timestamp of the most recent message, not the time at which the merge happened. If someone is using a merge I believe they don't care about accurate timestamps per leaf anymore.
Augmenting messages with data from other subscriptions is the goal behind the event-merge
processor.
This could be done by adding a cache to the influxDB output.
Right now, influxDB points are batched and written every 10s or 1000 points (configurable)
By adding a cache the batching will happen for the gnmi updates instead of the influxDB points.
The resulting merged message will have the timestamp of the most recent message, not the time at which the merge happened. If someone is using a merge I believe they don't care about accurate timestamps per leaf anymore.
The primary Arista use case would be counters when using target_defined subscriptions (10s for counters, on_change for the rest of state). Caching/TS changing of events would have a non-zero impact on delta calculation, I would think (but haven't fully thought through).
For example, we collect the facts from that eos target and maintain the JSON in an interface{} struct. Then a processor called event-augment can refer to this in-mem stored structure with a jq accessor, i.e.
I did re-write the tag subscription code in telegraf with a similar approach (see influxdata/telegraf#11019). Given a subscription like
/network-instances/network-instance[name=*]/protocols/protocol[identifier=BGP][name=BGP]/bgp/neighbors/neighbor[neighbor-address=*]/state
And a config like:
path: "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state/description"
elements:
- vrf
- protocol
- neighbor
It would be nice to say "if I receive an event with that has a /network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state/description
value key and all the following tags:
Write the value of the value to an in-memory store keyed by the above tags + the source tag
And the same processor would then identify events with the same set of tags above, and write the description as a new tag.
I'd be happy to work on this if the community thinks there is value in it. Not sure what the concurrency model of gnmic is, and what kind of locks would be needed on the processor datastore (or if each target gets its own instance of a processor, etc)
What you are describing is a subset of the caching method I described above (which is already implemented in the prometheus output)
It basically caches all the received updates in a gNMI cache which allows a processor run to be applied to all the updates. So a user can augment a data point with any value/tag from any other update message in the cache.
The difference with the suggested solution is that there is no need to define which paths and elements need to be cached.
While trying to figure out how to tag interface counters with interface descriptions, I discovered that EOS's Octa gNMI server apparently emits every leaf as a separate
gNMI.SubscribeResponse
. I tried using the documentedevent-merge
processor to combine these events, but that doesn't appear to work either. Debug log below: