networkservicemesh / sdk

Apache License 2.0
35 stars 35 forks source link

Forwarder processes requests too slowly when there are a lot of clients #1666

Open NikitaSkrynnik opened 1 week ago

NikitaSkrynnik commented 1 week ago

Description

After some analysis we found out that forwarder processes requests too slowly. Here are top 5 places:

  1. discoverforwarder - up to 6s
  2. discoverendpoint - up to 4s
  3. roundrobin - up to 1s
  4. postpone - up to 900ms
  5. Closes in many sdk-vpp chain elements - can take up to tens of seconds

discoverforwarder and discoverendpoint

The root cause of these issues is probably slow registry-k8s

Issues: https://github.com/networkservicemesh/sdk-k8s/issues/512


roundrobin

Needs more investigation...


postpone

The root cause of postpone being to slow is improper use of contexts in some places. trace relies heavily on context.WithValue. Also a lot of other chain elements use this function.

Issues: https://github.com/networkservicemesh/sdk/issues/1665 https://github.com/networkservicemesh/sdk/issues/1667


Closes in many sdk-vpp chain elements

Clients can wait for the error from a forwarder for the time much longer than request timeout because some chain elements call close if request fails.

Issues: https://github.com/networkservicemesh/sdk-vpp/issues/851

NikitaSkrynnik commented 1 week ago

@denis-tingaikin, @szvincze

NikitaSkrynnik commented 5 days ago

Current plan is to investigate why nsmgr is slow. It looks like the problems in forwarder and nsmgr are the same:

NikitaSkrynnik commented 5 days ago
Some statistics after rc.7 testing (40 clients, min and max request processing time, context size), only local cases: # TELEMETRY LOG LEVEL TIME MIN TIME MAX CONTEXT SIZE FIELDS
1 FALSE INFO 300ms 9s 187 fields fields_1.txt
2 TRUE INFO 2s 15s 569 fields fields_2.txt
3 TRUE TRACE 10s 40s 1239 fields fields_3.txt
NikitaSkrynnik commented 2 days ago

Did some analysis:

  1. These lines - consume up to 10 seconds
  2. These lines - up to 7 seconds
NikitaSkrynnik commented 2 days ago

Current plan:

NikitaSkrynnik commented 1 hour ago

dial chain element can consume up to 1s. We need to investigate what exactly affects the performance: grpc or unix socket.

Current plan: