masa-finance / masa-bittensor

Masa Bittensor Subnet - Decentralized, Fair AI
https://masa.ai
MIT License
14 stars 11 forks source link

spike: Score Rewards for Miner Responses #11

Closed Luka-Loncar closed 4 months ago

Luka-Loncar commented 5 months ago

In order to get correct rewards for the miner's work, we need to define a reward logic. The reward logic has to account different factors, such as the ones that are defined by these other two spikes:

This card is a spike to try to find a strategy that works, taking into account the above.

Timebox: 1 day

Acceptance criteria

juanmanso commented 5 months ago

Overall functionality is implemented on #7, however full structure of the response needs to be tested and not just a single field as it is now on the PR

hide-on-bush-x commented 5 months ago

There is no accuracy, timeliness, or other relevant metrics being consider in the current implementation so this ticket must stay on ready

mudler commented 4 months ago

blocked by https://github.com/masa-finance/masa-bittensor/issues/74 as we need to assert the validity of the date to have a good score

Luka-Loncar commented 4 months ago

We are going to get follow up tasks from @obasilakis for #74 - then we can see what is going to happen.

grantdfoster commented 4 months ago

the intention of this document is to collect and summarize relevant information from previous tickets, and provide any additional research necessary to determine an holistic approach on implementation.

Possible Variables

variables in bold suggested for implementation. for data accuracy, we currently leverage a parser_object as seen here, or a parser_method. it is suggested to use the Pydantic framework for more granular validation and analysis.

  1. Data Accuracy (Pydantic)
    • Completeness (all fields)
    • Consistency (data format)
    • Validity (adheres to schemas and constraints, i.e. types and ranges)
    • Uniqueness (no duplicate records, ensure ID's are unique)
    • Integrity (relationships between ID's and relevant entities)
    • Relevance (data contains query or intended context)
    • Security (does not contain any sensitive information)
  2. Timeliness
    • Response time (milliseconds or timeout)

Scoring Precision

Currently, as mentioned in #74, we score in a binary fashion. With potential scoring attributes such as timeliness and completeness, it is possible to score across a range, i.e. 0-1, not 0 or 1, and to use weights for given attributes (perhaps we value accuracy over response time, for instance).

It is possible we score (or punish) based on the types and amounts of errors in a validation flow, i.e.

try:
    TwitterProfile(**external_data)  
except ValidationError as e:
    print(e.errors())
    """
    [
        {
            'type': 'int_parsing',
            'loc': ('id',),
            'msg': 'Input should be a valid integer, unable to parse string as an integer',
            'url': 'https://errors.pydantic.dev/2/v/int_parsing',
        },
        {
            'type': 'missing',
            'loc': ('followers',),
            'msg': 'Field required',
            'url': 'https://errors.pydantic.dev/2/v/missing',
        },
    ]
    """

Suggestions & Considerations

Based on the above research, an initial implementation could incorporate a response cutoff time (timeout) along with granular scoring for data accuracy. This way we value data completeness over speed, but nonetheless require miners to respond within a given timeframe, or timeout threshold. The idea was passed around that validators would fetch data themselves, to then cache and compare the miners' work with. This approach seems backwards and rigid, and would result in false negatives if validators have a stale cache.

Implementation

Reward logic and scoring should take place in the reward() functions, defined within each validator data type. For example, to update reward logic for twitter profile responses, logic would be added to masa/validator/twitter/profile/reward.py. Similarly, for the implementation of relevant Pydantic models, updates can be made to masa/types.

One possible approach to scoring is to assume a score of 1 for a perfect response (i.e. no Pydantic errors, response returned within timeout threshold, and relevance check passes if applicable.) Then, for each Pydantic error, we decrement the score by a certain value, weighting the punishment by error type if we want to take it that far for the initial implementation.

Follow Up Tasks

Luka-Loncar commented 4 months ago

Thanks @grantdfoster great job with this spike.

@mudler @obasilakis @teslashibe @hide-on-bush-x @juanmanso what do you guys think about the proposed solution? Should we proceed with it, or does anyone have other suggestions?

mudler commented 4 months ago

It sounds good overall for a first approach at it - but I think scoring for pydantic error is actually a "language" nuance rather then a protocol reward, mostly to "punish" those that aren't strictly following the schema that actually rewarding for fairness, probably because the code being run could be modified or somehow faulty.

That is to say - if the language was actually statically typed, answers not following the schema would led to responses that couldn't be unmarshalled into a specific type and thus there would be no punishment as it would result in a client error (that we could punish as well, agreed). So, scoring by checking validity of responses is just as similar as trying to score if the client returned a valid data or not (just because it could return anything in principle).

I'd say it would sound at least more fair if we include a way to award faster responses over slow responses. Exceeding a timeout could likely mean to meet a bigger penalty, while instead faster responses could just increase the award by a certain factor (directly proportional).

hide-on-bush-x commented 4 months ago

Im down for following that implementation, but as @mudler mentioned, maybe its even easier to start with time instead of data validation. What I first thought was "What if that data changes and someone has the last version of something and it send a new correct schema?" that miner will be wrongly punished.

Aside of that, I would also bring some kind of random validation, where every x random amount of requests we send a request to the protocol and double check the miners responses against that ( this could lead to blacklisting nodes etc ). That could cover the data validation

grantdfoster commented 4 months ago

@hide-on-bush-x I like this solution a lot, it narrows the scope for the first implementation and strikes a balance between randomness and accuracy. Some comments and follow up tasks below.

Data Validation

Our parser_object and parser_function arguments currently "validate" the data by parsing fields... would we not want to remove or at least update this type of logic, so we are dealing with raw and/or non-sanitized data from the miners?

One case I'm thinking of is if a miner sends additional fields in their payload, we want to detect that and score poorly if indeed that request is randomly chosen for data validation. Similarly, from an API perspective, I'd think we want to return what the validator judges as the "best" response, or highest scoring, not a list of sanitized responses based on pre-defined types. I can make a task for this and we can discuss there.

Response Time

We are able to access the response time of a miner via synapse.dendrite.process_time as shown in the mocks. Also possible we can simply leverage the timeout for now if we plan to score speed in a binary fashion.

Tactic noted here for getting process_time from the dendrite object

Follow Up Tasks

teslashibe commented 4 months ago

@grantdfoster @hide-on-bush-x @mudler I think it would be interesting to explore the size of the response in bytes which is a heuristic we are using in the protocol. I think we need to boil down parameters and weight them to get a utility score.

For example: To consolidate and weigh the parameters for calculating a utility score based on the response from miners, we can consider the following parameters:

  1. Valid Fields (Vf): This parameter measures the completeness and validity of the fields in the response. A higher number of valid fields would indicate a more complete response, contributing positively to the utility score.

  2. Response Time/Process Time (Tr): This parameter measures how quickly a miner can respond to a request. A shorter response time indicates a more efficient miner, which should be weighted positively in the utility score.

  3. Size of Response in Bytes (Sb): This parameter measures the amount of data returned by the miner. Depending on the context, a larger response size could either indicate a more detailed and useful response or unnecessary verbosity. The impact on the utility score could be positive or negative based on the specific requirements of the request.

To calculate a utility score (U), we could consider a weighted sum of these parameters. The weights assigned to each parameter (w1, w2, w3) would depend on their relative importance to the overall utility of the response. The utility score could be represented as:

[ U = w1 \cdot Vf + w2 \cdot \frac{1}{Tr} + w3 \cdot Sb ]

Where:

The specific values for (w1), (w2), and (w3) would need to be determined based on the desired emphasis on completeness, efficiency, and detail of the responses.

This formula provides a starting point for calculating a utility score based on the response characteristics. Further refinement may be necessary based on empirical data and specific requirements of the protocol.

Can someone boil down a consolidated list of parameters we are going to weight @grantdfoster maybe you can summarize/

teslashibe commented 4 months ago
Screenshot 2024-06-27 at 12 06 12 PM
teslashibe commented 4 months ago

Also tagging @theMultitude here for visibility

mudler commented 4 months ago

@grantdfoster @hide-on-bush-x @mudler I think it would be interesting to explore the size of the response in bytes which is a heuristic we are using in the protocol. I think we need to boil down parameters and weight them to get a utility score.

For example: To consolidate and weigh the parameters for calculating a utility score based on the response from miners, we can consider the following parameters:

1. **Valid Fields (Vf):** This parameter measures the completeness and validity of the fields in the response. A higher number of valid fields would indicate a more complete response, contributing positively to the utility score.

I'd really keep this for punishment rather then awarding, where possible should have a very low impact. I'm saying this because field validity is very intrinsic to the language being used here (as python allows virtually to have any kinda of data back from the wire), and I don't think would make sense to award based on that - ideally many clients will follow the protocol - so we want to just punish those who doesn't (for instance, a old client version, or just trying to maliciously modifying the sources and run their own nodes)

2. **Response Time/Process Time (Tr):** This parameter measures how quickly a miner can respond to a request. A shorter response time indicates a more efficient miner, which should be weighted positively in the utility score.

:100: agreed here - I think it goes inline with what we have discussed above already

3. **Size of Response in Bytes (Sb):** This parameter measures the amount of data returned by the miner. Depending on the context, a larger response size could either indicate a more detailed and useful response or unnecessary verbosity. The impact on the utility score could be positive or negative based on the specific requirements of the request.

I'm not sure what value this actually brings. I have a bit of mixed feelings here and I'd say "less is more" at this stage, so I'd avoid it. I don't want this to bite us: as you mention too it really depends on the context, so evaluating the size might be a red herring for many cases. (for instance, an LLM might allucinate and give a big answer which is useless, or summarizing a big article in a big chunk, but in that case is helpful).


Awarding for the time of the response is still the best approach imho - because it really depends on:

that are the keys point of award IMHO - we need to provide a fair award for those that give more computational power to the network.

IMHO, Any metric or unit that let us measure and award "better" nodes, in terms of computational resources and how the resources are used is the best approach.

Example: in a scenario where users gives GPU computational power for LLM - a user want to get more rewards if the unit he is sharing it's beefy - otherwise there is no incentive as it would be even more expensive (e.g. electricity has a cost which is not neglectible, especially when talking about GPUs).

Anything related to the type of the request is going to be problematic because there is no quantitative way to measure the quality of the data that is in transit.

mudler commented 4 months ago

Notes for the long-term solution:

Even better - would be to be able to "sign" and know the computational power that is served by a certain node. We have to make sure however that malicious actors can't exploit that to get higher utilities assigned to them.

Maybe we can find a way to determine the Hardware and the network quality of a given node which is countersigned by the validator or a trusted party: for instance, with a numeric score for the network capacity, and a score for the computational capacity of the node that is calculated during startup and acknowledged by another party. We can likely use TPM chips capabilities here to guarantee that the software hasn't been changed and the machine reports the computational capacity to the network correctly.

Luka-Loncar commented 4 months ago

I want to add my 2 cents based on the technical whitepaper and the additional information provided by all of you above (I'm also supported here by sonnet 3.5). Here's a suggested solution for scoring and rewarding miners and validators:

Implementation steps:

This approach aligns with the technical whitepaper's emphasis on performance-sensitive staking and utility-based rewards while addressing the current limitations in the reward logic. It provides a more nuanced and fair reward system that incentivizes accuracy, speed, and overall quality of data provided by miners and validated by validators.

theMultitude commented 4 months ago

Here is some context about how we're thinking about initial scoring on the protocol side. Within you'll find links to other relevant tickets.

Luka-Loncar commented 4 months ago

Thanks @theMultitude! Really solid work here!

@grantdfoster - can you familiarize yourself with protocol side of rewards please? Since you have been owning the spike for this on the subnet side. We would like to get the final output from you so we can proceed with implementation.

grantdfoster commented 4 months ago

Summary

General consensus is that we should mimic the approach the protocol team took to calculating a utility score, utilizing a list of parameters and associated weights. It is important to include at least (2) parameters so the initial implementation incorporates weights, and a sense of balance (i.e. between response time and data quality). This comment serves as a summary of the above conversation, with updates to associated follow-up tasks.

Notes on proposed parameter(s) and weights:

Valid Fields (vF, w1): Should be punished for missing fields, not necessarily rewarded for having correct ones. Score can be impacted by a certain value for each missing field. Possible expansion into utilizing client version, and punishing those nodes running stale versions of the protocol (future case). Example weight: 0.2

Response Time/Process Time (Tr, w2): This parameter measures how quickly a miner can respond to a request. A shorter response time indicates a more efficient miner, which should be weighted positively in the utility score. Efficiency is usually important, but too high a weight might disadvantage slower but high-quality responses. Example weight: 0.8

Size of Response in Bytes: This parameter measures the amount of data returned by the miner. Depending on the context, a larger response size could either indicate a more detailed and useful response or unnecessary verbosity. The impact on the utility score could be positive or negative based on the specific requirements of the request.

Staked Masa: Incorporate the amount of staked MASA a node / miner has. Future case.

Proposal

Calculate a utility score based on valid fields and response time, using respective weights. Consider a weighted sum of the above green-lit parameters. The weights assigned to each parameter (w1, w2) depend on their relative importance to the overall utility of the response... examples given are 0.2 and 0.8, respectively. The utility score could be represented as:

$$[ U = w1 \cdot Vf + w2 \cdot \frac{1}{Tr} ]$$

Implementation Tasks

Necessary detail has been added to the relevant tickets. The goal here is to implement the calculation of utility score parameters, determine said score, and reward miners accordingly

122 Calculate Miner Response Time for Utility Score

123 Calculate Missing Fields to Punish Miner Utility Score

Possible Improvements

124 Return highest scoring response to the API