owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.37k stars 181 forks source link

events: separate main-queue into multiple streams or use multiple subjects #8949

Open wkloucek opened 5 months ago

wkloucek commented 5 months ago

Is your feature request related to a problem? Please describe.

Currently we have a single stream for events called "main-queue". We also use a single subject called "main-queue".

On this stream we have many consumer groups:

nats consumer report main-queue
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                     Consumer report for main-queue with 14 consumers                                                    │
├──────────────────────────────────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬───────────┬─────────────────────────┤
│ Consumer                                 │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor │ Cluster                 │
├──────────────────────────────────────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼───────────┼─────────────────────────┤
│ antivirus                                │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ audit                                    │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ clientlog                                │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ dcfs                                     │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ evhistory                                │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ frontend                                 │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ notifications                            │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ policies                                 │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ postprocessing                           │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ search                                   │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ sse-3bdbbf44-e309-4186-a21a-3374bc75143d │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-1*                 │
│ sse-69687f14-faf6-4f84-823b-f60574074a4f │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-2*                 │
│ storage-users                            │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
│ userlog                                  │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 392,350   │ nats-0, nats-1, nats-2* │
╰──────────────────────────────────────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴───────────┴─────────────────────────╯

Every consumer group receives all events because we have a single stream with a single subject.

Consumers that do some heavy work based on events may fall behind when a lot of events are generated. What happens to those consumers can be read here: https://docs.nats.io/running-a-nats-service/nats_admin/slow_consumers

Describe the solution you'd like

Have separate streams or subjects based on the kind / audience of events.

For example:

This would reduce event pressure on consumers that do client side filtering of events right now. It probably would reduce load on NATS itself since events need to be distributed to fewer consumers.

Describe alternatives you've considered

none

Additional context

wkloucek commented 5 months ago

You can observe the message backlog for the "dcfs" consumer group when trying what I describe in the test section of https://github.com/owncloud/ocis-charts/pull/538

kobergj commented 5 months ago

Yes. We considered this and it should not be very hard to implement. But I see some downsides with this approach.

That being said I see one part where we could extract some events to a different queue: SSEs. The clientlog and userlog service (maybe more in the future) both send events called SendSSE. These events are only interesting for the sse service, therefore could be sent in a separate queue. However I doubt that this few events will have a significant impact on ocis performance.

I am also a bit concerned that using multiple queues will bring even more complexity to already complex ocis configuration.

micbar commented 5 months ago

@wkloucek @kobergj Valid points.

I would suggest to work "problem oriented".

Where do we already see issues in the current implementation? How can we identify them and would splitting up the queues make any sense.

Activity

There is a new upcoming feature #8881 which will make heavy use of the event system. From the top of my head, i see no real difference in the consumer groups between SSE, Activity, Auditing, Userlog, Clientlog ... They seem to be interested in 90% of all events.

How would a helpful "split" would look like?

wkloucek commented 5 months ago

I would suggest to work "problem oriented".

I partly agree that this is about a problem that we may not yet have or not yet realized that we already have it. But I think we'll have this problem for sure when targeting instances with multi-thousand concurrent users.

I already once run into problems of what happens when you have a slow consumer (the dcfs client is set to concurrency of 1 in the oCIS product default). You can find a reproducer in https://github.com/owncloud/ocis/issues/8949#issuecomment-2074870615

From the top of my head, i see no real difference in the consumer groups between SSE, Activity, Auditing, Userlog, Clientlog ... They seem to be interested in 90% of all events.

But why does the dcfs need to listen on all of this if it only needs to know which upload can be finished? Even if acking on the client side is fast, not sending the events to the consumer in the first place is faster and more efficient.

micbar commented 5 months ago

But why does the dcfs need to listen on all of this if it only needs to know which upload can be finished? Even if acking on the client side is fast, not sending the events to the consumer in the first place is faster and more efficient.

maybe we could think about splitting out the "filesystem" consumers like antivirus, dcfs, postprocessing.

That could split apart the "Filesystem Events" from the "General Events".

kobergj commented 4 months ago

maybe we could think about splitting out the "filesystem" consumers like antivirus, dcfs, postprocessing.

That could split apart the "Filesystem Events" from the "General Events".

This would be an approach. But if we split away "Filesystem Events" (which probably includes sharing), there will be nothing left for the "General Events" queue. Also it is unclear in which queue space related events go?

But why does the dcfs need to listen on all of this if it only needs to know which upload can be finished?

dcfs needs not many events to work, that is true. But splitting these events in multiple queues will force other consumer (e.g. postprocessing) to listen to multiple different event queues. Also configuration for dcfs will increase in complexity as it then needs to push its events to a different queue than it is receiving from.

wkloucek commented 4 months ago

But splitting these events in multiple queues will force other consumer (e.g. postprocessing) to listen to multiple different event queues.

Isn't that something we can avoid by using subjects? https://docs.nats.io/nats-concepts/jetstream/streams#subjects

micbar commented 3 months ago

@kobergj @butonic @wkloucek I want to get this moving again.

We need to make a decision.

kobergj commented 3 months ago

My opinion is still the same as before:

That being said, if we want to go this way we should start by splitting "file system events" into a separate queue. But we need to decide which events are "file system events". What about ShareCreated for example?