opentensor / bittensor

Internet-scale Neural Networks
https://www.bittensor.com/
MIT License
851 stars 301 forks source link

Alpha is too high #1358

Closed mrseeker closed 1 year ago

mrseeker commented 1 year ago

this might need to change as we increase the number of uids. Lets put in a ticket and do deeper analysis once the network is stable

_Originally posted by @Eugene-hu in https://github.com/opentensor/bittensor/pull/1304#discussion_r1179385369_

This issue is causing validators that deleted their old model to lose trust (I am currently stuck at 0.81%) while the old validators are stuck at a "status quo" that is not healthy to the network.

Can someone fix this, since this issue is now already a month old.

adriansmares commented 1 year ago

Currently on subnet 1 there are 1024 UIDs, out of which 128 are validator permit nodes. There is also an immunity period of 24 hours.

By default, a validator will consider a random sample of 50 UIDs, which on average takes ~10 seconds. Running the reward model, at least on my modest hardware, takes another ~30 seconds, so we have in total about ~40 seconds to query 50 UIDs.

With 896 UIDs to hit, it takes about 896 / 50 * 40 = ~720 seconds or 12 minutes to hit the whole network. This is not entirely correct because the validator code in subnet 1 does not work with a shuffled reservoir, but instead only with samples, but for the purpose of napkin math this should still work.

A UID is hit 24 * 60 / 12 = 120 times in a day by a validator, which means that on average the EMA gets to be run 120 times.

For an alpha of 0.01, after 24 hours the validator will have only about ~70% of the final score for a newly registered UID:

>>> alpha=0.01
>>> s=0
>>> for _ in range(120):
...   s = s * (1-alpha) + 100 * alpha
...
>>> s
70.06196086876682

If we increase alpha to 0.05, we get ~99.7%:

>>> alpha=0.05
>>> s=0
>>> for _ in range(120):
...   s = s * (1-alpha) + 100 * alpha
...
>>> s
99.78775736213011

Even if the network is to be increased to 2048 UIDs, and it takes ~25 minutes to hit the whole network, and we hit a single UID ~60 times, with alpha 0.05 we know ~95% of the steady state score:

>>> alpha=0.05
>>> s=0
>>> for _ in range(60):
...   s = s * (1-alpha) + 100 * alpha
...
>>> s
95.39302010130474

My proposal is to increase alpha to 0.05, and to consider long term turning this into a subnet parameter instead. This parameter currently only lives in the validator code, and most operators won't edit it in order to solve such problems. In subnet 3 alpha is 0.1, and UIDs reach their steady state scoring reasonably fast.

cc @opentaco for thoughts on this.

If anyone has a closed form formula for the EMA, please let me know.

opentaco commented 1 year ago

@adriansmares appreciate the great analysis, we'll set the alpha to 0.05 for the openvalidators release.