bug: autosharding resolves content topics to wrong shard

fbarbu15 commented 8 months ago

I think there are 2 issues here" 1st: content topics resolves to a very big shard number if pubsub-topic is not present in the docker flags

EX: docker run -i -t -p 34696:34696 -p 34697:34697 -p 34698:34698 -p 34699:34699 -p 34700:34700 harbor.status.im/wakuorg/nwaku:latest --listen-address=0.0.0.0 --rest=true --rest-admin=true --websocket-support=true --log-level=DEBUG --rest-relay-cache-capacity=100 --websocket-port=34698 --rest-port=34696 --tcp-port=34697 --discv5-udp-port=34699 --rest-address=0.0.0.0 --nat=extip:172.18.139.12 --peer-exchange=true --discv5-discovery=true --cluster-id=2 --content-topic=/toychat/2/huilong/proto --relay=true --filter=true

Will resolve to /waku/2/rs/2/58355

While it should resolve to /waku/2/rs/2/3

2nd: content topics resolves any content topics to 0 if pubsub-topic is present in the docker flags

EX: docker run -i -t -p 34696:34696 -p 34697:34697 -p 34698:34698 -p 34699:34699 -p 34700:34700 harbor.status.im/wakuorg/nwaku:latest --listen-address=0.0.0.0 --rest=true --rest-admin=true --websocket-support=true --log-level=DEBUG --rest-relay-cache-capacity=100 --websocket-port=34698 --rest-port=34696 --tcp-port=34697 --discv5-udp-port=34699 --rest-address=0.0.0.0 --nat=extip:172.18.139.12 --peer-exchange=true --discv5-discovery=true --cluster-id=2 --pubsub-topic=/waku/2/rs/2/2 --content-topic=/toychat/2/huilong/proto --content-topic=/statusim/1/community/cbor --content-topic=/waku/2/content/test.js --relay=true --filter=true

I can see in the logs

DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku node" tid=1 file=waku_node.nim:296 pubsubTopic=/waku/2/rs/2/2
DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku relay" tid=1 file=protocol.nim:242 pubsubTopic=/waku/2/rs/2/2
DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku node" tid=1 file=waku_node.nim:296 pubsubTopic=/waku/2/rs/2/0
DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku relay" tid=1 file=protocol.nim:242 pubsubTopic=/waku/2/rs/2/0
DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku node" tid=1 file=waku_node.nim:296 pubsubTopic=/waku/2/rs/2/0
DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku relay" tid=1 file=protocol.nim:242 pubsubTopic=/waku/2/rs/2/0
DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku node" tid=1 file=waku_node.nim:296 pubsubTopic=/waku/2/rs/2/0
DBG 2024-03-19 15:08:00.560+00:00 subscribe                                  topics="waku relay" tid=1 file=protocol.nim:242 pubsubTopic=/waku/2/rs/2/0

While /toychat/2/huilong/proto should resolve to shard 3, /statusim/1/community/cbor to 4 and /waku/2/content/test.js to 1

alrevuelta commented 8 months ago

1st: content topics resolves to a very big shard number if pubsub-topic is not present in the docker flags

You are correct that resolving to /waku/2/rs/2/58355 is wrong, but note that /waku/2/rs/2/3 is not the expected shard. What makes you think that?

The configuration you provide lacks pubsub-topic, so how does the node know the available shards? If the node doesn't know the available topics, then it doesn't know the amount of shards, so its not possible to resolve it properly. How many shards there are? 1? 10?.

I suggest erroring if pubsub-topic is not configured. Note that in known clusters (eg 1) these are loaded automatically.

2nd: content topics resolves any content topics to 0 if pubsub-topic is present in the docker flags

Indeed a problem. Wondering though if it makes sense to configure just 1 shard that is --pubsub-topic=/waku/2/rs/2/2 instead of --pubsub-topic=/waku/2/rs/2/0. This can end up being a bit confusing, since if we allow that, then [/waku/2/rs/2/12, /waku/2/rs/2/2334, /waku/2/rs/2/99999] is technically allowed and can be messy. So I would suggest enforcing that the topic starts with 0 up to num_shards-1.

In summary:

For 1. can be fixed erroring if no pubsub-topic are provided (unless its a known cluster)
For 2. enforce that shards starts at /waku/2/rs/cluster-id/0 ... /waku/2/rs/cluster-id/num_shards-1

thoughts @jm-clius @SionoiS

fbarbu15 commented 8 months ago

/waku/2/rs/2/3

go-waku and js-waku are resolving it to shard 3 I've raised this issue in the context of interop testing between nwaku and go-waku/js-waku So if we change the resolving logic in one place we need to change in all 3 places

fbarbu15 commented 8 months ago

I suggest erroring if pubsub-topic is not configured

I might be wrong but also in the context of static and auto sharding I was thinking that:

if we start the node with pubsub-topic then we are using/testing static-sharding
if we start the node without pubsub-topic but with content-topic then we are using/testing auto-sharding

jm-clius commented 8 months ago

Urgh. This is messy in our config and I can understand the confusion. My feeling:

if we have pubsub-topic configured, there should be no content-topic configuration (i.e. static subscription, no autosharding)
for autosharding, content-topic can be configured but there MUST be shards configuration starting at 0 up to the configured shards num.

fbarbu15 commented 8 months ago

Based on the discussion, we can define the following configurations:

--pubsub-topic=/waku/2/rs/2/0 - Valid
--pubsub-topic=/waku/2/rs/2/1 - Invalid (starts with shard 1)
--pubsub-topic=/waku/2/rs/2/0 --pubsub-topic=/waku/2/rs/2/1 - Valid
--pubsub-topic=/waku/2/rs/2/0 --pubsub-topic=/waku/2/rs/2/2 - Invalid (missing shard 1)
--pubsub-topic=/waku/2/rs/2/0 --content-topic=/myapp/1/latest/proto - Invalid (mix of pubsub-topic and content-topic)
--content-topic=/myapp/1/latest/proto - Valid (resolves to shard 0)
--content-topic=/waku/2/content/test.js - Invalid (resolves to shard 1)
--content-topic=/myapp/1/latest/proto --content-topic=/waku/2/content/test.js - Valid
--content-topic=/myapp/1/latest/proto --content-topic=/app/22/sometopic/someencoding - Invalid (missing shard 1)

These topic resolutions are based on what I saw implemented in go-waku and js-waku, probably aligned with the RFC 51:

/myapp/1/latest/proto resolves to shard 0
/waku/2/content/test.js resolves to shard 1
/app/22/sometopic/someencoding resolves to shard 2

Please lmk if this is how it should work

SionoiS commented 8 months ago

The problem we face is because the network specs. for cluster id != 1 are undefined.

Couple points;

Not allowing a mix of static and autosharded topic seams arbitrary.
--pubsub-topic=<static shard> should always be valid, as per the RFC.
configured topics are subscriptions they cannot define the network.

Maybe adding a --total-shards=X as config so that autosharding can be configured? If not present and content topics are specified then error (except for TWN)?

The reasoning behind autosharding for cluster != 1 was to make it easier to debug and test. If this is not the case then what are we doing?

fbarbu15 commented 8 months ago

It does make the testing easier. I'm just trying to figure out which are the valid scenarios and what is invalid so I can write the interop tests accordingly

SionoiS commented 8 months ago

It does make the testing easier. I'm just trying to figure out which are the valid scenarios and what is invalid so I can write the interop tests accordingly

I feel like we should not test cluster != 1 since it's undefined behavior (in the context of autosharding) that would simplify thing greatly no? Since cluster != 1 is for testing purpose only, should it really be tested?

Also when you say valid or invalid in my mind it's always according to the spec. (spec say cluster != 1 undefined)

We need a simpler way to config Nwaku. Seams like this should be simple but it's not and makes testing difficult.

fbarbu15 commented 8 months ago

Can you share that spec please, maybe I'm missing something

We started testing autosharding on cluster != 1 mainly because on cluster 1 RLN is enabled by default and the testing frameworks are not configured for relay yet and even if they were, I feel the rate limiting will be a problem for tests where we send lots of messages in parallel

Also as discussed with @alrevuelta in this thread, autosharding on other clusters should be possible not only for testing purposes,

SionoiS commented 8 months ago

Can you share that spec please, maybe I'm missing something

TWN -> https://rfc.vac.dev/spec/64/#network-shards and autosharding -> https://rfc.vac.dev/spec/51/#automatic-sharding

We can make changes if some part is unclear.

We started testing autosharding on cluster != 1 mainly because on cluster 1 RLN is enabled by default and the testing frameworks are not configured for relay yet and even if they were, I feel the rate limiting will be a problem for tests where we send lots of messages in parallel

I see

Also as discussed with @alrevuelta in this thread, autosharding on other clusters should be possible not only for testing purposes,

Yes it is possible. Maybe we could define autosharding on cluster != 1 as having shards 0-1023 just like static shards? https://rfc.vac.dev/spec/51/#static-sharding

jm-clius commented 8 months ago

Revisiting this issue. I know this contradicts a bit what I said earlier, but @SionoiS makes some good points:

autosharding is a convenience API built on top of a network definition (i.e. a combination of cluster ID and shard numbers). It is unrelated to the shard subscriptions of the node so shouldn't be affected by any such shard subscription configuration. It does relate to the cluster ID (in that the cluster ID is shorthand for which network you're connected to)
the only network definition we had so far is cluster ID = 1 (TWN) which autoshards content topics to the 8 network shards. It makes sense that autosharding was only available on this cluster up to now. Any other cluster ID is assumed to use static/explicit mapping to shards.
for controlled testing purposes, it probably makes sense to define at least one other test network that is not RLN rate limited (therefore open and experimental only), but that mirrors the shards we have allocated for TWN on cluster ID=1
I wouldn't define autosharding on any other cluster IDs, as none of these are tied to network definitions (even 1023 shards per cluster ID was just a suggestion and shouldn't be taken as spec directive)

In short, my suggestion:

define new network for a test network without RLN that is symmetrical to TWN. I suggest cluster ID = 9, shards=0-7 (on the assumption that we want to keep cluster IDs 2-7 for RLN-controlled testnets and cluster ID 8 to eventually mirror whatever we deploy to cluster ID 0 in the future.)
autosharding on this test network 9 mirrors autosharding on TWN 1 exactly (and we roll out new shards in new generations in parallel to both 1 and 9).
do not further add to or complicate configuration or tie autosharding (high-level network app function) to subscription config
autosharding undefined on any cluster IDs != 1 or 9

WDYT of this suggestion? cc @fbarbu15 @SionoiS @alrevuelta @richard-ramos

fbarbu15 commented 8 months ago

Thanks @jm-clius, that makes sense for me and it would simplify testing for autosharding.

alrevuelta commented 8 months ago

define new network for a test network without RLN that is symmetrical to TWN. I suggest cluster ID = 9, shards=0-7 (on the assumption that we want to keep cluster IDs 2-7 for RLN-controlled testnets and cluster ID 8 to eventually mirror whatever we deploy to cluster ID 0 in the future.)

Unsure what you mean by this. A network is defined by its nodes. You can already do this by running nodes like this and connecting them together.

./build/wakunode2\
  --relay=true\
  --pubsub-topic=/waku/2/rs/9/0\
  --pubsub-topic=/waku/2/rs/9/1\
  --pubsub-topic=/waku/2/rs/9/2\
  --pubsub-topic=/waku/2/rs/9/3\
  --pubsub-topic=/waku/2/rs/9/4\
  --pubsub-topic=/waku/2/rs/9/5\
  --pubsub-topic=/waku/2/rs/9/6\
  --pubsub-topic=/waku/2/rs/9/7\
  --cluster-id=9

autosharding on this test network 9 mirrors autosharding on TWN 1 exactly (and we roll out new shards in new generations in parallel to both 1 and 9).

Sure. Guess we would need a network flag for this (auto vs static). But tbh, if status is not using static sharding, why having auto vs static? Just one type of sharding and that's it (=auto).

SionoiS commented 8 months ago

A network is defined by its nodes.

Only, if we assume that a node subscribe to all shards. In your example the network used has cluster id 9 BUT we can only know that it has at least 8 shards and without the exact number of shards, autosharding cannot be used.

In short, my suggestion: .....

Yes, by not allowing autosharding for every cluster we don't have to deal with the edge cases just yet. 👍

jm-clius commented 8 months ago

A network is defined by its nodes

Within the context of a configuration, RLN membership set and autosharding, the "network" definition is a combination of cluster ID and shards (with implied generation, membership set, rate limit and contract address). We have a single RLN membership for the network defined by cluster ID 1. If I have a node participating in the Waku Network and this node is encapsulated in an application that decides to use autosharding, any content topic the node publishes to should map according to a hash function that is "aware" of the 8 shards that define Gen 0 of the network. This should be true, whether the node is subscribed to all 8 pubsub topics or not. In fact, the subscriptions of the node should not affect the hashing of content topics to pubsub topics at all (by specification).

why having auto vs static?

I don't think strictly-speaking Waku has autosharding - it's just a convenience API on top of a Waku network. We only have "static" sharding on the Waku protocol level. Autosharding provides a convenient application level API that automatically populates the shard/pubsub topic based on content topic. For that it needs to know what the hash space looks like, which depends on a cluster + num_of_shards (and generation) definition. We've only defined TWN 8 shards so far. My proposal is to define one more such network for testing purposes. I don't necessarily think we'd need to add configuration - autosharding is an implied use case if the application uses an API without specifying a pubsub topic.

alrevuelta commented 8 months ago

A network is defined by its nodes.

Only, if we assume that a node subscribe to all shards. In your example the network used has cluster id 9 BUT we can only know that it has at least 8 shards and without the exact number of shards, autosharding cannot be used.

imho that's not how it works. pubsubTopics define all available ones:

This subscribes to all because thats the default

./build/wakunode2\
  --relay=true\
  --pubsub-topic=/waku/2/rs/9/0\
  --pubsub-topic=/waku/2/rs/9/1\
  --pubsub-topic=/waku/2/rs/9/2\
  --pubsub-topic=/waku/2/rs/9/3\
  --pubsub-topic=/waku/2/rs/9/4\
  --pubsub-topic=/waku/2/rs/9/5\
  --pubsub-topic=/waku/2/rs/9/6\
  --pubsub-topic=/waku/2/rs/9/7\
  --cluster-id=9

But this, only subscribes to 0, 1.

./build/wakunode2\
  --relay=true\
  --pubsub-topic=/waku/2/rs/9/0\
  --pubsub-topic=/waku/2/rs/9/1\
  --pubsub-topic=/waku/2/rs/9/2\
  --pubsub-topic=/waku/2/rs/9/3\
  --pubsub-topic=/waku/2/rs/9/4\
  --pubsub-topic=/waku/2/rs/9/5\
  --pubsub-topic=/waku/2/rs/9/6\
  --pubsub-topic=/waku/2/rs/9/7\
  --cluster-id=9
  --shard=0 --shard=1

SionoiS commented 8 months ago

imho that's not how it works. pubsubTopics define all available ones:

Well I'm even more confused now. I though --pubsub-topic= were to be deprecated. We can't if they are used this way. :thinking:

alrevuelta commented 8 months ago

Well I'm even more confused now. I though --pubsub-topic= were to be deprecated. We can't if they are used this way. 🤔

This is not the final solution, but a node should know the available shards of the network. Either directly (topics) or implicitly (like networkAmountShards).

I would deprecate it, but I'm afraid that will break some things that by now I'm unaware how to deal with.

jm-clius commented 8 months ago

a node should know the available shards of the network

Right. I notice that this is used to initialize the sharding. The only place this number of shards is used seems to be for autosharding. In retrospect, this seems to me to be the wrong approach, especially because we also subscribe to all --pubsub-topic topics. The number of shards per network gen is a specification and not a configuration. It should be decoupled from subscriptions. I would have this number hard-coded for now - anyone using the autosharding API should expect their content topics to be hashed deterministically to shards 0-7 for this generation of the network. Furthermore, we should have only one way to configure topic subscription: --cluster-id: and a repeated --shard:

Everything else should be deprecated, IMO.

waku-org / nwaku

bug: autosharding resolves content topics to wrong shard #2538