netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.72k stars 482 forks source link

Management Server is limited to 100 simultaneous peers #1824

Closed TSJasonH closed 2 months ago

TSJasonH commented 5 months ago

Describe the problem

After 100 peers have connected the management server stops letting more connect.

To Reproduce

Steps to reproduce the behavior:

  1. Self-Host
  2. Permit more than 100 peers
  3. Have them all connect
  4. after 100 no more get a connection

Expected behavior

A more realistic limit

Are you using NetBird Cloud?

Self-Hosted

NetBird version

0.27.1 (management)

Additional context

For any user beyond the limit, the management log shows these WARN entries:

"{"log":"2024-04-10T12:50:02Z WARN management/server/grpcserver.go:376: failed logging in peer Ij6aLkfZU7qzUOgfzTAZMaaLNAOGsow2SDmdP+8Rxig=\n","stream
":"stderr","time":"2024-04-10T12:50:02.614030981Z"}"

While chatting in the slack my colleague found this reference, which seems that it may be related. Can this be made to a customize-able setting that defaults to 100 if not otherwise specified?

https://github.com/netbirdio/netbird/blob/3ed2f08f3c5dd930a598a26f24cf028807816486/management/server/updatechannel.go#L13

const channelBufferSize = 100

https://github.com/netbirdio/netbird/blob/main/management/server/updatechannel.go#L83-L85

// mbragin: todo shouldn't it be more? or configurable? channel := make(chan *UpdateMessage, channelBufferSize) p.peerChannels[peerID] = channel

pappz commented 5 months ago

Hello! Could you set trace log level for the server and collect the relevant part of the new logs? You can set it with the "--log-level" , "trace" command parameters.

TSJasonH commented 5 months ago

I'd be happy to provide whatever is needed. I'm not sure exactly what the "relevant part" is. I gathered trace logs for a few minutes while this is happening and wasn't exactly sure which entries are most relevant for you.

Here's a clump of the log - is that useful?

management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer 4/JzpIqInXK1wdadB5F7rVi0u6n6IehsxvapPsi74kw=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:196: sent an update to peer 4/JzpIqInXK1wdadB5F7rVi0u6n6IehsxvapPsi74kw=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co1e0fs11epihjcae740 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer co2264s11epihjcae7m0
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer f0qx2mNSUY4hwXaVXq7T1xz7SCCl73VEkk/xBEu0+GU=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:196: sent an update to peer f0qx2mNSUY4hwXaVXq7T1xz7SCCl73VEkk/xBEu0+GU=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co63ct411epm9iuvd780 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer cmt9dtf13t6cu7gofra0
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer cy8snf31k/GUmiJmSjCqq5OHj6UxktQZuh0ah+crJzg=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:196: sent an update to peer cy8snf31k/GUmiJmSjCqq5OHj6UxktQZuh0ah+crJzg=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer coak74411epmg8a5ovb0 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co620lc11epm9iuvd770 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co2t5i411epihjcugt0g has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co27f1k11epihjcae870 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co1qjgs11epihjcae7ag has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co234lk11epihjcae7og has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer co2u1j411epihjcugt1g
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer gkbQFko845iP9WIyYdbYyJXq93dXsM1mCRc+HQbtMw0=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:196: sent an update to peer gkbQFko845iP9WIyYdbYyJXq93dXsM1mCRc+HQbtMw0=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer co3g7ac11epm9isopqu0
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer tZ1YZ0TO8AZOLcyIQpb+FPIDOuVL74MH73TkP5pd+gs=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:196: sent an update to peer tZ1YZ0TO8AZOLcyIQpb+FPIDOuVL74MH73TkP5pd+gs=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co2keo411epihjcae8n0 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer cnuvfqk11epihjaflaag has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co244uk11epihjcae7q0 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer cnsqd5k11epihjafla0g
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer l9Lqrb955BGiSZQqx16pHeXc9SgJo4V8n0ZQu4rlDCA=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:196: sent an update to peer l9Lqrb955BGiSZQqx16pHeXc9SgJo4V8n0ZQu4rlDCA=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co7celk11epm9io1sa90 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer coakp7411epmg8a5ovbg has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer co2jsa411epihjcae8lg
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co29r3k11epihjcae8fg has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer cm3k4seabkf1d47d673g
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer k5wK0qIuOq0cdRsBPCmsnavyZkhrI0VnQuS8hEyqYis=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:196: sent an update to peer k5wK0qIuOq0cdRsBPCmsnavyZkhrI0VnQuS8hEyqYis=
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co4j9pc11epm9irg53d0 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co0msis11epihjcae6r0 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co22p3s11epihjcae7ng has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer cnga35713t6blui698dg has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer cnqbjs411eppa23f8ft0 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:60: peer co0tbf411epihjcae6u0 has no channel
management-1  | 2024-04-10T15:38:03Z DEBG management/server/updatechannel.go:54: update was sent to channel for peer cngcn4n13t6blui698k0
management-1  | 2024-04-10T15:38:03Z DEBG management/server/grpcserver.go:180: received an update for peer qu1GjuTffbayt2hPmwDAjqtoqxriSZpxjLTVhtVQSFk=
bravosierrasierra commented 3 months ago

@pappz Same problem on 0.27.10. Changing const channelBufferSize to 1000 modifies maximum peers to 1000. This is strange, but works.

mlsmaycon commented 3 months ago

Hello @bravosierrasierra the buffer size is a queue that indicates the number of messages to send a specific peer in the netbird network.

So you can have hundreds of thousands of nodes and each one of them will have a max of 100 messages that can be queued.

It seems like you are facing another issue with the deployment which is causing these messages.

Can you share the all management logs? And confirm if you have JWT group sync enabled?

akastav commented 3 months ago

I confirm that I have encountered a similar problem. It is also described here https://github.com/netbirdio/netbird/issues/1782

After adding 101 participants to 1 group, errors occur and randomly 1 of the participants is not announced. Sometimes - this participant is a router. I corrected it https://github.com/netbirdio/netbird/blob/main/management/server/updatechannel.go#L13 increasing the value to 1000 and the problem is gone

mlsmaycon commented 3 months ago

Hi @akastav can you share if you have JWT group sync enabled?

bravosierrasierra commented 3 months ago

Hi @akastav can you share if you have JWT group sync enabled?

yes, we use groups from JWT/Keycloak

mlsmaycon commented 3 months ago

Ok, this is probably the main issue. There is a bug to be fixed in our roadmap which causes lot's of group reconfiguration. You can check that by the number of duplicated group events you have in the activity view.

bravosierrasierra commented 3 months ago

Ok, this is probably the main issue. There is a bug to be fixed in our roadmap which causes lot's of group reconfiguration. You can check that by the number of duplicated group events you have in the activity view.

But why increasing channelBufferSize solve problem? Does this bug about JWT groups from roadmap have a link we can follow?

mlsmaycon commented 3 months ago

@bravosierrasierra @akastav, what are your IDP providers?

bravosierrasierra commented 3 months ago

@mlsmaycon we are both use different keycloak-s

mlsmaycon commented 3 months ago

Thanks @bravosierrasierra. Would it be possible for you to confirm the events in the activity tab? If you see duplicates, please share the JWT decoded data from one of the users affected. If you join our Slack channel, we can help you get that so you can also share in DM.

bravosierrasierra commented 3 months ago

We are not seeing a storm of events after increasing channelBufferSize. Just rare messages about users connecting.

mlsmaycon commented 3 months ago

We found the root cause of the issue and we are working on a fix.

mlsmaycon commented 3 months ago

Hey folks, the PR has been merged and will be in our next release.

mlsmaycon commented 2 months ago

Hey folks, have you tested? Should we close this issue?

TSJasonH commented 2 months ago

I'll be doing the upgrade this Saturday.

TSJasonH commented 2 months ago

I finished my upgrades to 0.28.4 and re-enabled JWT sync. I can confirm that I'm not seeing multiple repeating group inserts for users any longer. I believe this can be closed now. Thanks!!

TSJasonH commented 2 months ago

Back to a full working day with > 140 peers connected and the mgmt service is showing no signs of problems. (FYI, I did the migration to postgres too :-)

mlsmaycon commented 2 months ago

That's excellent; thanks, @TSJasonH, for double checking. I am closing this now.