Add missing events as to make it possible to better analyze network activity

MonsieurNicolas commented 4 months ago

https://github.com/stellar/stellar-core/pull/3847 took a first pass at this for transaction level metrics.

Goal: make it possible to run queries in BigQuery (based off the meta) as to surface information needed to help support decisions when making network settings changes or quantify impact of certain protocol changes.

Design principle: reduce coupling between systems by avoiding duplicating complex logic between core and downstream systems (like complex formulas, or subtle protocol behavior).

In particular understanding of:

how resources are used in general
resource fees
inclusion fees, and possibly try to understand why inclusion fees would be higher

At the ledger level:

values of network settings should be accessible in downstream systems. Derived values, while technically can be recomputed should be emitted:
- write_fee_per_1kb (fairly complicated and formula may change over time) -- related to #4244
consider emitting information on a per tx set component basis
- we may be able to emit information that cannot be trivially computed outside of core (surge pricing related for example)
overall: do we have the right information to reason about ledger wide resource utilization vs limits?

At the transaction level:

total resource utilization per resource type (that corresponds to aggregate from contract invoke + whatever is charged in C++)
- we could opt to instead only emit the missing parts like the computed resource fee
per contract invocation resource utilization. this is needed to answer questions such as "overhead caused by a specific WASM, or a specific instance". We may already have this.
overall: do we have the right information to reason about per tx resource utilization vs limits?

MonsieurNicolas commented 4 months ago

@sydneynotthecity tagging you here

MonsieurNicolas commented 4 months ago

Also related to this: it may be worth standardizing (SEPs?) on certain types of diagnostic events so that we can ensure some level of stability over time on structure/semantics so that downstream processors can consume them. Note that in general diagnostic events are likely aggregated/parsed by downstream, we want to be able to provide better/more accurate events over time and not be tied up to specific traces: imagine if we have a diagnostic event that gives information when entering/leaving some "scope" (the stable concept), then the runtime could emit events when the scope is as coarse as a "VM" or as granular as method/host function.

sydneynotthecity commented 4 months ago

➕ Yes, I agree that diagnostic events should be standardized where applicable. Otherwise the code to parse and aggregate events will be brittle and the maintenance burden will be high.

stellar / stellar-core

Add missing events as to make it possible to better analyze network activity #4245