Open NikitaSkrynnik opened 1 week ago
@denis-tingaikin, @szvincze
Current plan is to investigate why nsmgr
is slow. It looks like the problems in forwarder
and nsmgr
are the same:
nsmgr
nsmgr
without tracesSome statistics after rc.7 testing (40 clients, min and max request processing time, context size), only local cases: |
# | TELEMETRY | LOG LEVEL | TIME MIN | TIME MAX | CONTEXT SIZE | FIELDS |
---|---|---|---|---|---|---|---|
1 | FALSE | INFO | 300ms | 9s | 187 fields | fields_1.txt | |
2 | TRUE | INFO | 2s | 15s | 569 fields | fields_2.txt | |
3 | TRUE | TRACE | 10s | 40s | 1239 fields | fields_3.txt |
Did some analysis:
Current plan:
dial
chain element can consume up to 1s. We need to investigate what exactly affects the performance: grpc or unix socket.
Current plan:
find
requests and count avg dial
time
Description
After some analysis we found out that forwarder processes requests too slowly. Here are top 5 places:
discoverforwarder
- up to 6sdiscoverendpoint
- up to 4sroundrobin
- up to 1spostpone
- up to 900mssdk-vpp
chain elements - can take up to tens of secondsdiscoverforwarder
anddiscoverendpoint
The root cause of these issues is probably slow
registry-k8s
Issues: https://github.com/networkservicemesh/sdk-k8s/issues/512
roundrobin
Needs more investigation...
postpone
The root cause of
postpone
being to slow is improper use of contexts in some places.trace
relies heavily oncontext.WithValue
. Also a lot of other chain elements use this function.Issues: https://github.com/networkservicemesh/sdk/issues/1665 https://github.com/networkservicemesh/sdk/issues/1667
Closes in many
sdk-vpp
chain elementsClients can wait for the error from a forwarder for the time much longer than request timeout because some chain elements call
close
ifrequest
fails.Issues: https://github.com/networkservicemesh/sdk-vpp/issues/851