provenance-io / provenance

A distributed, proof of stake blockchain designed for the financial services industry.
https://provenance.io
Apache License 2.0
89 stars 39 forks source link

Metadata telemetry not being recorded #2144

Closed SpicyLemon closed 1 month ago

SpicyLemon commented 2 months ago

Summary of Bug

None of the metadata telemetry stuff is actually being emitted.

Version

v1.19.1 (all the way back to v1.3.0 when it was "added").

Steps to Reproduce

  1. Connect a node to datadog.
  2. Try to find any telemetry info for the "metadata" key.

Note: I haven't actually tried those steps, but I'm certain telemetry isn't being recorded.

I found this by noticing that we are attempting to record telemetry using code like this:

    defer types.GetIncObjFunc(types.TLType_Scope, action)

That GetIncObjFunc function returns a func() that actually does the telemetry stuff.

When execution reaches that line, it queues up the call to types.GetIncObjFunc(types.TLType_Scope, action) to run at the end of the current function. So at the end, it calls GetIncObjFunc, which just creates a func(). But nothing tells go to actually execute that func, so it's simply discarded (similar to how defer iter.Close() just discards any error returned by iter.Close()).

If that line were this (note the extra () at the end):

    defer types.GetIncObjFunc(types.TLType_Scope, action)()

Then, when execution reaches that line, it would execute types.GetIncObjFunc(types.TLType_Scope, action), then it would queue up the resulting func() to run at the end of the current function. At the end, it'd then run the thing that updates telemetry before finishing it's return.

To verify that nothing is happening:

  1. I created a global var TelemetryCounter int in the events.go file.
  2. I updated GetIncObjFunc to increment TelemetryCounter at the end of the func() it returns (after the call(s) to telemetry.IncrCounterWithLabels).
  3. I updated some unit tests to check the TelemetryCounter value before and after executions that should be doing telemetry stuff.
  4. I wrote a simpler unit test that just did defer types.GetIncObjFunc(types.TLType_Scope, action).
  5. I found that the TelemetryCounter value was never changing.
  6. I then wrote a unit test that just did defer types.GetIncObjFunc(types.TLType_Scope, action)() and found that TelemetryCounter DID increment in that test.

The Problem

The GetIncObjFunc function returns a func() that updates telemetry data. However, the resulting func() is never being executed; it's being created then thrown away.

There's a couple ways to fix this.

  1. Add () to the end of all the invocations, e.g. defer types.GetIncObjFunc(types.TLType_Scope, action)().
  2. Refactor GetIncObjFunc to just do the telemetry updates instead of returning a function that does it; probably want to rename it too since it's now doing something instead of getting a function.

Further, I don't think we get anything by deferring these calls. In almost all cases, it's being done at the very end of the func anyway.

Further Consideration

It's probably worthwhile to have a discussion about whether this telemetry data is still wanted. I.e. should we fix this or delete that stuff?

Here's what it was supposed to track:

Labels:


For Admin Use

iramiller commented 2 months ago

After the last discussion it probably doesn't make sense to track object counts in these kinds of metrics. Any data warehouse will already have a complete picture of the type and number of assets already and will not rely on these counters. If there was value here it would likely be in processing time counters that track how long it takes to validate records--but even there we can get this data based on overall request processing time which is handled separately.