opentensor / validators

Repository for bittensor validators
https://www.bittensor.com/
MIT License
14 stars 9 forks source link

Flexibilize wandb config #136

Closed p-ferreira closed 1 year ago

p-ferreira commented 1 year ago

Problem

As shown in #134 , some members of the community rely on the wandb config for data extraction.

The change implemented in #132 was seeking to reduce space for exploits of the data logged in wandb config.

The rationale was that valuable information for filtering such as hotkey, netuid, version, createdAt among others would be already there through the tags and wandb metadata.

Following the native mongodb api exposed by wandb, ideally one would be able to filter runs with the following code:

import wandb
from datetime import datetime, timedelta

# Date that you want to filter, in this case 3 days ago
date_filter = datetime.now() - timedelta(days=3)

api = wandb.Api()
all_runs = api.runs("opentensor-dev/openvalidators", filters={
    "$and": [
        {"createdAt": {"$gt": date_filter.timestamp()}},
        {"tags": {"$all": ["1.1.8", "netuid_1"]}}
    ]})
print(len(runs))

Unfortunately, wandb api throws the following internal error for the call:

wandb: Network error (HTTPError), entering retry loop.

HTTPError: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/graphql

The foundation is currently in contact with wandb team in order to enable a communication channel seeking to improve the integration with their platform, that is currently relatively unstable from their API perspective.

One can work around this issue by querying all the runs and filtering them manually, preferably with retry mechanisms in place as wandb API throws eventual exceptions from time to time.

Bellow an example of how to filter runs by tags, username and date:

import wandb
from datetime import datetime, timedelta
import logging
from tenacity import retry, stop_after_attempt, wait_fixed, before_sleep_log

api = wandb.Api()

# Tags that you want to filter
filter_tags = ['1.1.8', 'netuid_1']

# User that you want to filter
username_filter = 'opentensor-pedro'

# Date that you want to filter, in this case 3 days ago
date_filter = datetime.now() - timedelta(days=3)

@retry(stop=stop_after_attempt(10), wait=wait_fixed(0.5), before_sleep=before_sleep_log(logging.getLogger(), logging.WARNING))
def get_filtered_runs():
    all_runs = api.runs('opentensor-dev/openvalidators', filters={'tags': {"$in": filter_tags}})
    print('Total collected runs:', len(all_runs))

    filtered_runs = []        
    for run in all_runs:
        # Check if run has all filter tags
        run_matches_filter_tags = all(filter_tag in run.tags for filter_tag in filter_tags)
        run_matches_username = run.user.username == username_filter     
        run_matches_date = datetime.strptime(run.created_at, '%Y-%m-%dT%H:%M:%S') > date_filter

        if run_matches_filter_tags and run_matches_date and run_matches_username:
            filtered_runs.append(run)        

    return filtered_runs

filtered_runs = get_filtered_runs()
len(filtered_runs)

The solution above is far from being the best as it’s very slow and the data retrieved from wandb is not reliable (total number of runs do not match what is seem in the UI filter). This issue was not identified when filtering by config.netuid .

Proposed solution

With all that in mind, it could be interesting for everybody to bring back parts of the original config, such as

surcyf123 commented 1 year ago

Hey @p-ferreira,

Thanks for the proposed solution to allow the community to continue to pull and analyze the wandb data. I have a couple thoughts on your solution.

Firstly, I noticed your solution includes relying on the tags for netuid. The problem with that is that netuid_X is not always present within the tags. There are many runs that just look something like this.

Tags,1.1.2|1012|5GnZ46pqSnEAK518wA7DiwNN4AkDhDtkbMdjseKDawMwCxMg|reciprocate_reward_model|rlhf_reward_model,4

But it seems that after the most recent wandb logging updates, in validator version >= 1.1.3, the netuid flag is present. So this should work for netuid as long as validator version is > 1.1.3. But even on August 18th, there were some validators running 1.1.0 and 1.1.1, so it would be cleaner to have it specified within the config details. So to solve the first issue you mentioned, we can just look at a solution such as the PR I pushed to keep netuid within the config. And then we could filter by just adding this line while using the API.

for run in api.runs(project_path):
   netuid = run.config.get('netuid')
   created_timestamp = datetime.fromisoformat(run.created_at).timestamp()

   run_states[state] = run_states.get(state, 0) + 1

   if start_timestamp < created_timestamp and netud == 1:
      history_data = run.history()
      new_data = history_data.to_dict(orient="records")

This method is very stable and I used it to pull every wandb run ever for all validators with no issues.

Secondly, the tags to contain the reward models used, but they don't contain their weighting. The relevant config details that can solve this is to keep the config details such as reward_rlhf_weight among other config reward model weights.

For these reasons, I still think it is the best solution to keep the config in place as it was, but remove the potential network vulnerabilities by removing the config keys as mentioned in my PR.

p-ferreira commented 1 year ago

Hey @surcyf123!

I see, so for what I understood, like you mentioned in #134 , you saw people running miners alongside validators within wandb, what would justify the reason of the openai key in the validator config (since both miner and validator are running together). You also mentioned:

I can share the dataset of all previous config details if it would be of interest to you while considering this PR.

Yes please, it would be valuable for us to have insights on how people in the community use the config so we can better adapt our current structure to the community needs.


Now, addressing the other points that you added in your comment:

Firstly, I noticed your solution includes relying on the tags for netuid. The problem with that is that netuid_X is not always present within the tags. There are many runs that just look something like this...

Exactly, as you already noted, the netuid tag came precisely in version 1.1.3, #95.

About the validators running older versions, we are aware of this reality. We always orient users to update their validators so they are up to date with the latest changes and we invested efforts to automate the whole process in order to make it easy for them (like the autorun script). We could try to think in a new way on how to incentive users to keep their validators up to date.

But I’m totally agree with you, I do see the value of bringing back the netuid in the config


Secondly, the tags to contain the reward models used, but they don't contain their weighting. The relevant config details that can solve this is to keep the config details such as reward_rlhf_weight among other config reward model weights.

Yes, that’s why we decided to keep the reward models config in #132 . Currently, all the neuron and reward information (including weights) are logged there as they were before.

You mentioned in the your PR #134 that users have a different set of keys in their config (including open_ai key). Could you provide us with this information so we can have a better grasp of how people are actually using this?

The rationale behind #132 was to have the best config for the majority of users, given our perspective. I do see how beneficial would be to bring back some information like netuid and wandb config, but if there are a big use case to include other data that might be important for the community, we can definitely have a discussion on how to move forward.

The keys in your PR don’t accurately represent the default keys of the openvalidators config, so we are between the line of adding things that will be useful for everyone and adding things for a particular set of users that have custom configs.

So with all of that in mind, any information on the custom configs and use cases of the community in order to proceed with the solution discussion would be greatly appreciated