openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
171 stars 55 forks source link

Prometheus scrape based metrics drop frequently #259

Closed spolack closed 6 months ago

spolack commented 11 months ago

Hello,

first of all thank you for that amazing tool! Basically unlocks unlimited ways to mangle our metrics!

I'm still evaluating, but seem to have found the first bug. The metrics in the scrape endpoint are frequently dropping, feels like a timeout, but i can confirm from the API endpoint and debug log, that metrics are still received on the sample-interval from the subscription.

Config

subscriptions:
  cpu:
    paths:
      - "/components/component/cpu/utilization/state/instant"
    stream-mode: sample
    sample-interval: 15s

outputs:
  prom-output:
    type: prometheus
    listen: :8080
    path: /metrics 
    expiration: 60s 
    metric-prefix: "arista" 
    append-subscription-name: true
    override-timestamps: false
    export-timestamps: false
    strings-as-labels: false
    timeout: 10s
    debug: true
    target-template:
    num-workers: 1

api-server:
  address: :8081
  timeout: 10s
  enable-metrics: true

Metrics:

curl http://mgmt-r1-lab2-de:8080/metrics  
# HELP arista_cpu_components_component_cpu_utilization_state_instant gNMIc generated metric
# TYPE arista_cpu_components_component_cpu_utilization_state_instant untyped
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU0",source="127.0.0.1:6030",subscription_name="cpu"} 1
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU1",source="127.0.0.1:6030",subscription_name="cpu"} 1
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU10",source="127.0.0.1:6030",subscription_name="cpu"} 4
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU11",source="127.0.0.1:6030",subscription_name="cpu"} 4
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU12",source="127.0.0.1:6030",subscription_name="cpu"} 2
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU13",source="127.0.0.1:6030",subscription_name="cpu"} 1
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU14",source="127.0.0.1:6030",subscription_name="cpu"} 8
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU15",source="127.0.0.1:6030",subscription_name="cpu"} 2
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU2",source="127.0.0.1:6030",subscription_name="cpu"} 2
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU3",source="127.0.0.1:6030",subscription_name="cpu"} 0
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU4",source="127.0.0.1:6030",subscription_name="cpu"} 1
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU5",source="127.0.0.1:6030",subscription_name="cpu"} 1
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU6",source="127.0.0.1:6030",subscription_name="cpu"} 1
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU7",source="127.0.0.1:6030",subscription_name="cpu"} 1
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU8",source="127.0.0.1:6030",subscription_name="cpu"} 4
arista_cpu_components_component_cpu_utilization_state_instant{component_name="CPU9",source="127.0.0.1:6030",subscription_name="cpu"} 1
# Following number grows every 15s
gnmic_subscribe_number_of_received_subscribe_response_messages_total{source="127.0.0.1:6030",subscription="cpu"} 17

Following chart shows the problem Screenshot 2023-10-19 at 23-56-35 Prometheus Time Series Collection and Processing Server

While debugging i thought of that the prometheus outputs internal cache might fail to receive updates. For that i tried to enable export-timestamps on the output to check, whether the timestamps get updated with every interval.

Instead of helping debugging this issue this fixed the problem. Now the metrics are permanently present and the timestamp is updated every 15s.

Best Simon

spolack commented 11 months ago

I'm also seeing following log output, but that is possibly unrelated? The HTTP requests never time out, the endpoint always quickly responds.

2023/10/19 21:02:10.130216 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:308: [prometheus_output:prom-output] writing expired after 10s
2023/10/19 21:02:10.130217 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:308: [prometheus_output:prom-output] writing expired after 10s
2023/10/19 21:02:10.130229 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:308: [prometheus_output:prom-output] writing expired after 10s
2023/10/19 21:02:10.130282 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:308: [prometheus_output:prom-output] writing expired after 10s
2023/10/19 21:02:10.130292 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:308: [prometheus_output:prom-output] writing expired after 10s
spolack commented 11 months ago

Reverting to Version 0.32.0 also fixes that issue.

karimra commented 11 months ago

Hi, I'm not sure I get what is the issue you see. The timeout is about writing the metric to the local cache. v0.32.0 didn't have a timeout. If you have the same symptoms you would see high memory usage.

spolack commented 11 months ago

The metrics simply disappear in the prometheus scrape endpount after ~1minute. Few seconds later they are back.

The configuration should be sufficient to reproduce. What i find interesting is that it simply works when enabling export_timestamps.

karimra commented 11 months ago

Are the routers and the machines where you run gNMIc and prometheus synced with an NTP server? What is the time offset between them?

If the metrics disappear after 60s it might be that the expiration attribute it responsible. With export-timestamps: true, gNMIc uses the timestamp received from the router to mark metrics as expired. With export-timestamps: false. gNMIc uses the timestamp at which the metric was added to mark it as expired. In this case prometheus uses the scrape timestamp to store the metric.

Another thing to check is which timestamp the router includes in the subscribeResponses. I know Arista routers have a knob to change the ts send-time to last-change-time. If it's set to the latter, you are probably getting some old timestamps for some notifications.

spolack commented 11 months ago

The machines are the routers itself! I assume zero time offset.

I collected some log for a single metric.

bash-4.2# ip netns exec ns-MGMT /mnt/flash/gnmic --config /mnt/flash/gnmic.yml subscribe --debug
2023/10/25 23:38:34.237274 /home/runner/work/gnmic/gnmic/app/app.go:222: [gnmic] version=0.33.0, commit=d8f931e, date=2023-10-08T17:08:01Z, gitURL=https://github.com/openconfig/gnmic, docs=https://gnmic.openconfig.net
[..]
2023/10/25 23:38:34.256913 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.58.2/grpclog/logger.go:53: [gnmic] [core] [Channel #1] Channel Connectivity change to READY
2023/10/25 23:38:34.256924 /home/runner/work/gnmic/gnmic/app/gnmi_client_subscribe.go:219: [gnmic] target "localhost" gNMI client created
2023/10/25 23:38:34.256944 /home/runner/work/gnmic/gnmic/app/gnmi_client_subscribe.go:222: [gnmic] sending gNMI SubscribeRequest: subscribe='subscribe:{subscription:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} mode:SAMPLE sample_interval:15000000000}}', mode='STREAM', encoding='JSON', to localhost
2023/10/25 23:38:34.306576 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277097585598825 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:38:34.306717 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:sync_response:true}
2023/10/25 23:38:34.306780 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277097585598825,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:38:34.306866 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:38:34.30683334 +0000 UTC m=+0.103545344
2023/10/25 23:38:49.307105 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277127584156213 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:2}}}}
2023/10/25 23:38:49.307234 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277127584156213,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":2}}
2023/10/25 23:38:49.307278 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=2.000000,time=nil,addedAt=2023-10-25 23:38:49.307256663 +0000 UTC m=+15.103968667
2023/10/25 23:39:04.307129 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277137584081794 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:39:04.307260 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:39:04.307299 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:39:04.307281168 +0000 UTC m=+30.103993162
2023/10/25 23:39:19.306720 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277137584081794 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:39:19.306842 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:39:19.306893 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:39:19.306866352 +0000 UTC m=+45.103578356
2023/10/25 23:39:34.306827 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277137584081794 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:39:34.306944 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:39:34.307016 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:39:34.306983488 +0000 UTC m=+60.103695492
<=== METRICS DISAPPEAR

2023/10/25 23:39:49.307001 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277187583141822 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:0}}}}
2023/10/25 23:39:49.307117 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277187583141822,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":0}}
2023/10/25 23:39:49.307156 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=0.000000,time=nil,addedAt=2023-10-25 23:39:49.307138443 +0000 UTC m=+75.103850437
<=== METRICS REAPPEAR

^C
received signal 'interrupt'. terminating...

It seems that its currently only updating the timestamp on change.

Still i see several updates within those 60s:

2023/10/25 23:38:34.306780 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277097585598825,
2023/10/25 23:38:49.307234 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277127584156213,
2023/10/25 23:39:04.307260 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,
2023/10/25 23:39:19.306842 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,
2023/10/25 23:39:34.306944 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,
spolack commented 11 months ago

I just configured it to set the timestamp to send-time, whereas default seems to be last-change-time.

Now i see every update coming in with an updated timestamp. Still after 60s the metrics dissapear until next update.

spolack commented 10 months ago

This problem is still occuring with latest release. 0.32.0 is the last working release.

[spolack@r1-lab1-de ~]$ /mnt/flash/gnmic version
version : 0.34.3
 commit : d461fa9
   date : 2023-11-14T02:13:11Z
 gitURL : https://github.com/openconfig/gnmic
   docs : https://gnmic.openconfig.net
karimra commented 10 months ago

Can you set expiration: -1s just to make sure that the metrics are missing because they are marked expired by gNMIc ?

spolack commented 10 months ago

With expiration: -1s the metrics are continuously returned, but they are not updating.

karimra commented 9 months ago

Can you make sure the new notifications coming from the router have higher timestamps than the previous ones?

spolack commented 9 months ago

Yes. I was able to confirm that before.

The machines are the routers itself! I assume zero time offset.

I collected some log for a single metric.

bash-4.2# ip netns exec ns-MGMT /mnt/flash/gnmic --config /mnt/flash/gnmic.yml subscribe --debug
2023/10/25 23:38:34.237274 /home/runner/work/gnmic/gnmic/app/app.go:222: [gnmic] version=0.33.0, commit=d8f931e, date=2023-10-08T17:08:01Z, gitURL=https://github.com/openconfig/gnmic, docs=https://gnmic.openconfig.net
[..]
2023/10/25 23:38:34.256913 /home/runner/go/pkg/mod/google.golang.org/grpc@v1.58.2/grpclog/logger.go:53: [gnmic] [core] [Channel #1] Channel Connectivity change to READY
2023/10/25 23:38:34.256924 /home/runner/work/gnmic/gnmic/app/gnmi_client_subscribe.go:219: [gnmic] target "localhost" gNMI client created
2023/10/25 23:38:34.256944 /home/runner/work/gnmic/gnmic/app/gnmi_client_subscribe.go:222: [gnmic] sending gNMI SubscribeRequest: subscribe='subscribe:{subscription:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} mode:SAMPLE sample_interval:15000000000}}', mode='STREAM', encoding='JSON', to localhost
2023/10/25 23:38:34.306576 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277097585598825 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:38:34.306717 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:sync_response:true}
2023/10/25 23:38:34.306780 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277097585598825,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:38:34.306866 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:38:34.30683334 +0000 UTC m=+0.103545344
2023/10/25 23:38:49.307105 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277127584156213 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:2}}}}
2023/10/25 23:38:49.307234 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277127584156213,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":2}}
2023/10/25 23:38:49.307278 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=2.000000,time=nil,addedAt=2023-10-25 23:38:49.307256663 +0000 UTC m=+15.103968667
2023/10/25 23:39:04.307129 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277137584081794 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:39:04.307260 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:39:04.307299 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:39:04.307281168 +0000 UTC m=+30.103993162
2023/10/25 23:39:19.306720 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277137584081794 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:39:19.306842 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:39:19.306893 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:39:19.306866352 +0000 UTC m=+45.103578356
2023/10/25 23:39:34.306827 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277137584081794 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:1}}}}
2023/10/25 23:39:34.306944 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":1}}
2023/10/25 23:39:34.307016 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=1.000000,time=nil,addedAt=2023-10-25 23:39:34.306983488 +0000 UTC m=+60.103695492
<=== METRICS DISAPPEAR

2023/10/25 23:39:49.307001 /home/runner/work/gnmic/gnmic/app/collector.go:70: [gnmic] target "localhost": gNMI Subscribe Response: &{SubscriptionName:cpu SubscriptionConfig:{"name":"cpu","paths":["/components/component[name=CPU12]/cpu/utilization/state/instant"],"mode":"stream","stream-mode":"sample","encoding":"json","sample-interval":15000000000} Response:update:{timestamp:1698277187583141822 update:{path:{elem:{name:"components"} elem:{name:"component" key:{key:"name" value:"CPU12"}} elem:{name:"cpu"} elem:{name:"utilization"} elem:{name:"state"} elem:{name:"instant"}} val:{uint_val:0}}}}
2023/10/25 23:39:49.307117 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277187583141822,"tags":{"component_name":"CPU12","source":"localhost","subscription-name":"cpu"},"values":{"/components/component/cpu/utilization/state/instant":0}}
2023/10/25 23:39:49.307156 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:449: [prometheus_output:prom-output] saved key=3840991386943600484, metric: name=arista_cpu_components_component_cpu_utilization_state_instant,labels=[component_name=CPU12,source=localhost,subscription_name=cpu],value=0.000000,time=nil,addedAt=2023-10-25 23:39:49.307138443 +0000 UTC m=+75.103850437
<=== METRICS REAPPEAR

^C
received signal 'interrupt'. terminating...

It seems that its currently only updating the timestamp on change.

Still i see several updates within those 60s:

2023/10/25 23:38:34.306780 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277097585598825,
2023/10/25 23:38:49.307234 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277127584156213,
2023/10/25 23:39:04.307260 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,
2023/10/25 23:39:19.306842 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,
2023/10/25 23:39:34.306944 /home/runner/work/gnmic/gnmic/outputs/prometheus_output/prometheus_output/prometheus_output.go:435: [prometheus_output:prom-output] got event to store: {"name":"cpu","timestamp":1698277137584081794,
karimra commented 9 months ago

Those metrics are being resent with the same timestamp:

2023/10/25 23:39:04.307260 ....... : {"name":"cpu","timestamp":1698277137584081794, 2023/10/25 23:39:19.306842 ........: {"name":"cpu","timestamp":1698277137584081794, 2023/10/25 23:39:34.306944 .......: {"name":"cpu","timestamp":1698277137584081794,

A metric is defined by the subscription name, the path as well as the labels

karimra commented 9 months ago

If the router is not able to send those notifications with an updated timestamp you can use the override-ts processor to make gnmic override the timestamp on reception ref

michaelpsomiadis commented 9 months ago

i think I solved this myself by specifically adding a cache: {} key under the prometheus output section like so. Here's the docs that helped me

outputs:
  prom-output:
    type: prometheus
    listen: :9804
    cache: {}

once I added this my curl was showing the metrics holding stable, something like this.

while true; do curl -s gnmic:9804/metrics | wc -l; sleep 1; done
spolack commented 9 months ago
    cache: {}

i can confirm, that adding a cache fixes the issue. In addition some metrics were expiring after 60s, but changing the timestamp to send-time fixed it for those.

What i'm wondering is, why it dit work flawless with 0.32.0.

karimra commented 7 months ago

@spolack I think I found what the issue is and fixed it in v0.35.1 could you give it a try ?

mdebord-dlr commented 7 months ago

I think I am having this same issue, will try v0.35.1

UPDATE: Fixed for me in 0.35.1. Ty!

karimra commented 6 months ago

Closing this, feel free to reopen if you believe it's not fixed.