spike: Protocol Analysis

theMultitude commented 3 months ago

Problem Statement

We currently don't have analytical systems in place to monitor the protocol's usage and performance.

In my opinion we have a couple of paths in the near term:

Perform analysis on data maintained by the protocol: the state ledger and node interactions.
Implement an event driven system to emit and store events for analysis in a centralized space as outlined here

Discussion points

There are trade-offs for both of these options and they aren't mutually exclusive but I believe the second is more robust while requiring less work on the protocol side, and is thus a quicker solution.

One critical pain point is that I believe storing the granularity of data I'd be interested in would be prohibitive if it's stored in each block:

To accommodate keeping more granular data on the protocol we have the option of only storing the current state of the network (aggregates and averages) on the state blockchain, having individual nodes store more specific data related to their interactions, and then reconciling the two.
- For example, an individual worker stores it's own utility over time but to interact with the network that utility has to reconcile with their current utility which is stored officially by a validator. The validator is a summarization (in this case an average) of more granular data stored elsewhere.
- This can correspond with any node specific data so we're distributing the granularity while maintaining our ability to verify.
- The main drawback for me with this approach for analysis is that the distributed nature makes it much more difficult to aggregate and manipulate the data.
- Furthermore this path has many dependencies on technology still being developed for use within the protocol potentially delaying its functionality.

The other main point is a question of separation of concerns and unnecessary data:

An event driven system that feeds a centralized data store for analysis allows us to avoid needing to persist granular data only used for tuning the protocol.
This approach is also more modular and easier to modify without cascading implications to the actual logic of the protocol.

The main drawbacks to implementing an event-driven data layer are:

The costs associated with the Cloud infrastructure
the upfront engineering needed to stand up the main components (maintenance and extensibility are much easier).

Summary

I believe implementing an event driven infrastructure should be a priority as it would:

avoid the bloat of granular data (meant for analysis) being stored on the protocol.
begin to decouple analytics by leaning into an SDK as a layer to extract information from the protocol.
create an immediate pathway for saving protocol information in relation to disaster recovery

Acceptance Criteria:

[x] A more detailed analysis around the implications of each decision particularly regarding proximate work streams.
[x] We have a roadmap for the growth of analytics in Q3 and Q4 addressing noted limitations with a high level view to amelioration.

theMultitude commented 2 months ago

SDK Ticket

theMultitude commented 2 months ago

The simplistic flow to understand:

[ ] Oracles
- [ ] request received/submitted
- [ ] request sent/error
[ ] Worker/s
- [ ] request received
- [ ] work completed/error
[ ] Validator
- [ ] work received
- [ ] reviewed work
- [ ] parse utility/rewards

Further I'll outline a consistent structure to make parsing this data easier.

It'll follow something like:

{
  "timestamp": "2024-07-05T10:30:00Z",
  "node_id": "oracle_id",
  "event_type": "request_received",
  "details": {
    "cid": "..."
    "key": value
     etc.
  }
}

{
  "timestamp": "2024-07-05T10:30:05Z",
  "node_id": "worker_1",
  "event_type": "work_received",
  "details": {
    "cid": "...",
    "received_from": "oracle_id"
  }
}

theMultitude commented 2 months ago

@j2d3 please add your thoughts/hesitations here ASAP so we can dig into a resolution. I will add more specific data structures once we're straightened out and ready to proceed.

cc @teslashibe

jdutchak commented 2 months ago

@theMultitude this is the code that ships a json payload to s3 https://github.com/masa-finance/masa-oracle/pull/392/commits/563c1506c2217455da8ae3e904e9ae5de6dc0920

and this would be how you call it where jsonPayload contains the data to send

err = db.SendToS3(id, jsonPayload)
if err != nil {
 logrus.Errorf("[-] Failed to send oracle data: %v", err)
}

theMultitude commented 2 months ago

Summary of Q3 Work Streams

From an analytics perspective, one of the most commonly encountered issues is realizing you haven't collected the data needed for future analysis. As we go about trying to fine tune an economic model and stabilize the protocol during organic growth we don't want to find ourselves in that situation. In contrast to periodic data pulls that offer static glimpses event driven analytics gives visibility into critical state changes. The following is an outline of data streams I see as essential to analytics work within the current quarter (Q3 2024) at Masa.

Node State (vertices) - How does node state change over time? Node state at any point in time is encompassed within the nodeData structure as it currently exists. However, understanding how node state evolves is important for understanding the make-up of our network and how nodes mature over time.
Node Relationships (edges) - How do nodes relate to one another? I want to understand which nodes interact with other nodes and how those patterns develop over time.
Work Threads - How does a request for data from the protocol propagate and come to completion?

These data streams don't need to exist immediately but taking the time to carve out their foundations will make refining them infinitely easier as we move forward.

Associated Tickets

[ ] #419
[ ] #297
[ ] feat: Add Context for Requests Coming from Bittensor
[ ] spike: Outline Experimentation Tooling and Flow
[ ] #304

masa-finance / masa-oracle