Remove abusable code from subnet template

cxmplex commented 3 months ago

The default subnet code has been reused on major mainnet subnets leading to predictable issues with lack of checks towards stakeless validators, and a triggerable exception with a longer stacktrace that a normal blacklistexception.

Forcing an error in the blacklist_fn leads to an amplication of the normal blacklist response. For instance, a normal blacklist repsonse will state the short reason, i.e. not registered. The error blacklist response will include the full error message. Some error messages may be longer than others (and the ValueError one is), leading to an amplication in the normal response. The effect amplification is around 2-3x depending on the original return text of the blacklist_fn.

Utilizing these flaws, attackers were able to do two things:

Subnet 2 (missing stake check)

On subnet 2, a validators minimum stake was not being checked. As a result, a malicious miner was able to register a low stake validator, and prompt users with challenges. The purpose of these challenges was to slow miners average response time in comparison to their normal response time to a real validator. In addition, the attacker employed several sneaky methods such as monitoring other validator requests, in order to sync their request to a validators, forcing a concurrent request on the innocent miner.

As a result, I believe it is best practice to leave in a default stake blacklist as an example, and allow subnet devs to REMOVE this if they wish, not the other way around. In best practice, the safest solution should be offered as the example, not a barebones solution. If the goal was for subnet devs to inspect the blacklist function and tailor it to their use-case, this has been demonstrated to not happen on at least 6 subnets (likely more). In most subnets, a stakeless validator has no actual purpose of calling a miners synapse. In practice, this lack of a default example has been shown to cause issues in various subnets that are often not picked up on right away.

Subnet 31 (triggerable unhandled exception)

Other affected subnets with a similar issue to 31 included subnet 33, 28, 25, 16, 11, 5. Here's my leading theory on why this leads to exhaustion, rather than the normal blacklist behavior, which as @mjurbanski-reef has shown, already throws an exception normally. Let's compare both scenarios, specifically taking a look at the traceback that will be printed here:

https://github.com/opentensor/bittensor/blob/master/bittensor/axon.py#L959

Unhandled exception in blacklist_fn:

blacklist_fn -> blacklist -> dispatch

Normal handled exception stacktrace:

blacklist -> dispatch

As you can see, forcing an error in the blacklist_fn in the subnet (note not the blacklist fn in axon.py) leads to a larger stacktrace. This stacktrace would cost additional resources to gather and print, and would even possibly include stack trace information from the asyncio upstream, metagraph calls, etc. The stacktrace length should be roughly 3-5x the size due to the increased callstack.

While a normal exception thrown by blacklist with normal behavior (i.e. the person is blacklisted) leads to a much shorter stacktrace, only blacklist, then caught by dispatch.

It's my belief this is part one to why this leads to exhaustion. The second part, is the increased response size when throwing an error from the blacklist_fn. A combination of both means greater CPU/mem pressure, and greater networking pressure. When hundreds to thousands of requests are coming in, small inefficiencies like these can play a big part to exhausting your miner. From testing, removing the unhandled exception from blacklist_fn led to continued operation for the miner while being attacked, while not handling the exception and letting it pass downstream caused server reboots on various miners machines, and overall total DoS to the axon. Why this would happen? I can only guess differences in stacktrace/callstack length.

mjurbanski-reef commented 3 months ago

I must admit I don't get this

a) overhead of uncaught exception - isn't that exactly how blacklist operates? by throwing an exception? https://github.com/opentensor/bittensor/blob/16f26047bbcce7c8e283599ad9523de0638891d3/bittensor/axon.py#L1318 b) amplification of response - wasn't the response just string value of exception text? why was it bigger? Do you have example?

Btw, in bittensor v7 there are some changes in how exceptions are handled, and even less information is actually returned to the client: https://github.com/opentensor/bittensor/pull/1822/files#diff-7b2c4a5751ea6956b701861216059a5cef7c3c3e236e77e9295b0b8f6a6438e2R978 So it should make amplification even less possible

cxmplex commented 3 months ago

Logs show amplification in response size with ValueError response vs non registered rejection and increased delay in processing. After moving that line below, I experienced a 90% reduction in latency to only a few ms each request. The amount of requests was 1000s per second, so I think any additional overhead could cause issues.

On Wed, May 22, 2024 at 5:08 PM Maciej Urbański @.***> wrote:

I must admit I don't get this

a) overhead of uncaught exception - isn't that exactly how blacklist operates? by throwing an exception? https://github.com/opentensor/bittensor/blob/16f26047bbcce7c8e283599ad9523de0638891d3/bittensor/axon.py#L1318 b) amplification of response - wasn't the response just string value of exception text? why was it bigger? Do you have example?

Btw, in bittensor v7 there are some changes in how exceptions are handled, and even less information is actually returned to the client: https://github.com/opentensor/bittensor/pull/1822/files#diff-7b2c4a5751ea6956b701861216059a5cef7c3c3e236e77e9295b0b8f6a6438e2R978 So it should make amplification even less possible

— Reply to this email directly, view it on GitHub https://github.com/opentensor/bittensor-subnet-template/pull/88#issuecomment-2125749542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRBZ2C6TY4NDES2JZGL3B3ZDUCONAVCNFSM6AAAAABIDDGRXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVG42DSNJUGI . You are receiving this because you authored the thread.Message ID: @.***>

cxmplex commented 3 months ago

Reopening soon with more comprehensive pull.

cxmplex commented 3 months ago

examples of the attacks:

2024-05-22 06:59:25.684 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49350 | 500 | None is not in list 2024-05-22 06:59:25.685 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49360 | 200 | Success INFO: 104.54.246.177:49360 - "POST /Dummy HTTP/1.1" 500 Internal Server Error 2024-05-22 06:59:25.693 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49360 | 500 | None is not in list 2024-05-22 06:59:25.694 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49364 | 200 | Success 2024-05-22 06:59:25.701 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49364 | 500 | None is not in list INFO: 104.54.246.177:49364 - "POST /Dummy HTTP/1.1" 500 Internal Server Error 2024-05-22 06:59:25.702 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49356 | 200 | Success 2024-05-22 06:59:25.710 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49356 | 500 | None is not in list INFO: 104.54.246.177:49356 - "POST /Dummy HTTP/1.1" 500 Internal Server Error 2024-05-22 06:59:25.711 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49366 | 200 | Success INFO: 104.54.246.177:49366 - "POST /Dummy HTTP/1.1" 500 Internal Server Error 2024-05-22 06:59:25.718 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49366 | 500 | None is not in list 2024-05-22 06:59:25.719 | TRACE | axon | <-- | 0 B | Dummy | None | 104.54.246.177:49378 | 200 | Success INFO: 104.54.246.177:49378 - "POST /Dummy HTTP/1.1" 500 Internal Server Error

example of missing minimum stake attack:

https://discord.com/channels/799672011265015819/1220504695404236800/1239769414585286738

cxmplex commented 3 months ago

@mjurbanski-reef

Addressed, and added a theory on why this might be. Also added in minimum stake example as well, with explanation on why I think this is needed (based on real world exploitation in sn2) and should be offered by default rather than simply mentioning that you may want to do this. This isn't a MAY, it's a MUST, in my opinion, hence the change in the documentation as well, unless otherwise demonstrated in their subnet design pattern.

Anyways, you shouldn't have unhandled exceptions (especailly of generic types like ValueErorr) that can be triggered by users even if they are effectively caught upstream.

cxmplex commented 3 months ago

I don't think 1/3 bigger is accurate FYI. @mjurbanski-reef

Perhaps I'm wrong, but you can't just count functions and say okay it's 1 extra. Its dependent on the total depth of nested calls and the complexity of the execution path of the function that throws the exception. Not like this matters too much, they straight up just shouldn't print a greedy stacktrace like that as you've mentioned, especially if its triggerable by an outside user.

opentensor / bittensor-subnet-template

Remove abusable code from subnet template #88