Per-topic MsgIdFunction

Wondertan commented 2 years ago

Pubsub only allows setting global MsgIdFunction, which applies to all the topics. Still, It would be useful to have the ability to set custom per topic msg id generation, which, e.g., can be dependant on the msg format of some topic. For this to be implemented using the global MsgIdFunction, it would need to be aware of all the msg types of each topic and do some ugly switching to determine what MsgIdFunction to use for what msg.

Fortunately, an unused TopicOpt can set custom values while joining the topic, so some per topic customizations are conceived by design, and the described feature can be implemented.

vyzo commented 2 years ago

Is this actually necessary? The msgID function has access to the message and can look at the topic. So the desired functionality is already there.

vyzo commented 2 years ago

Note that there is no inherent need for ugly switching, one can use a map with a readonable default.

Wondertan commented 2 years ago

Imagine there are two completely different protocols relying on PubSub. One of them needs custom msg id generation and another one is not. The application which relies on these two protocols and creates a PubSub instance would need to be aware of the protocol internals and import their msg types to do the switch. So the ugliness comes not from the switch itself but from the ugly architecture requiring the application to know the internals. Sorry for the confusion.

Wondertan commented 2 years ago

The point is to allow msg id generation to be application/protocol/layer specific.

In my case, I need to exclude one non-deterministic field of a structure from id generation with blake2b, but I don't want to have one global MsgIdFunction importing all msg types from all the protocols to switch over them.

vyzo commented 2 years ago

Well ok, but it complicates things.

Wondertan commented 2 years ago

How?

vyzo commented 2 years ago

The right way to do it internally is to have one msgid function, which dsipatches on a map of funcs set potentially per topic and a default that can be overwritten as is now with the option.

vyzo commented 2 years ago

this way the internal changes will be minimal, and we dont break existing applications.

Wondertan commented 2 years ago

The solution I proposed keeps the logic and API the same and still allows customization. If WithCustomMessageIDFunc is not given for a topic, then the topic uses the global default or opted one, so nothing should break.

In other words,WithCustomMessageIDFunc optionally overrides global MsgIdFunction only for a specific topic.

vyzo commented 2 years ago

That's not very elegant i think, it makes users reinvent the wheel and messy internals. We can do better.

Lets create a struct type that has a map and a default, with a RWMutex. It also has a method for computing the msgid, which caches and we pass it where we otherwise use the msgid func. Lets also add the topic option, which mutates the map with exclusive lock.

This way nothing changes in the external interface and on the same time we get a maximally flexible solution. The extant msgid option would set the default.

vyzo commented 2 years ago

Btw, for future reference, it is preferrable to discuss design in an issue and reach consensus before opening the pr, so that you dont do unnecessary work.

Wondertan commented 2 years ago

I am in 👍🏻

We can do better.

Thank you for the detailed description. Qs:

The map like map[string]map[*pb.Message]string? // topic -> msg -> id
- So we don't need to add ID field on the Message(wrapper). Is this bad and/or introduces messy internals? IMHO, adding ID to the Message(wrapper) is clean, but I might be missing something.
Do we want to lock the processLoop? Although in practice map mutates would happen only on application start as rarely topics are joined in the runtime, so this is not a concern.
WDYT about msgIDGen as the name for the struct?
Should I still split the implementation into two PRs? For once only compute and for the option.
This requires each topic to have a ptr to the struct so the option can access it. Are you ok with that?

Lets also add the topic option, which mutates the map with exclusive lock.

Agree and I wanted to write a proposed solution but it turned out to be easier to implement it instead of describing it in English for this case :)

Btw, for future reference, it is preferrable to discuss design in an issue and reach consensus before opening the pr, so that you dont do unnecessary work.

vyzo commented 2 years ago

The map will be a map[string]func(*pb.Message)string. Basically a map of topic names to functions that compute their msgid. The entry point in the object will be a func (m *Message) string method, which does the following:
- Checks the cache on m; if it is already computed, it returns it directly.
- RLocks the map, and retrieves the compute func
- Invokes the compute func with the *pb.Message, caches and returns the result
  1. There is absolutely no need to lock the process loop; the lock is rw and scoped in the object that manages the msgid func map.
  2. Not bad, but we can use more vowels. Maybe MsgIDGenerator?
  3. No, just do one pr. We don't ahve to change the code that touches the msgid at all actually, we just pass the method in our MsgIdGen object to SetMsgID and we are set.
  4. No, it doesn't; it has a pointer to the pubsub object and this is sufficient. We can invoke methods in the msgIDGen instance in pubsub directly.

Agree and I wanted to write a proposed solution but it turned out to be easier to implement it instead of describing it in English for this case :)

Yeah, sometimes it is easier and that's totally fine, I just don't want you to do work that might be thrown away at design stage. Then again you learn the internals better by doing that, so it's not a total loss.

vyzo commented 2 years ago

Note that the cache is new field we add on the Message wrapper struct. It doesn't have to be thread safe, as the first thing we do with the message before any concurrency is involved is compute its message id.

vyzo commented 2 years ago

Also note that we should relinquish the rlock before invoking the compute function, so that we don't block other stuff while it is computing.

Wondertan commented 2 years ago

We still need to change MessageCache a bit as func (m *Message) string != func (m *pb.Message) string. This includes both Add(to rcv a wrapper) and SetMsgIdFn(to rcv a generator with wrapper as param) methods.

We don't ahve to change the code that touches the msgid at all actually, we just pass the method in our MsgIdGen object to SetMsgID and we are set.

This is also why I originally thought of the map with *pb.Message, as we don't have access to the wrapper in most places.

vyzo commented 2 years ago

Ah yes indeed, but thats a very small change.

BigLep commented 2 years ago

2022-01-07: there is a linked PR (https://github.com/libp2p/go-libp2p-pubsub/pull/465 ) that we're trying to get across the line.

libp2p / go-libp2p-pubsub

Per-topic MsgIdFunction #464