I run the nym-gateway for a while as part of the shipyard program. Most of the time it has two socks-clients attached which are not reachable by anyone except me.
Being puzzled about some strange-looking os-level metrics I found that the number of context switches of the gateway process is suspiciously high.
The process had been running for roughly six days at that moment. I started to take hourly snapshots of the amount of context switches for two weeks to learn how this evolves over time. They show that the values are very high on average but at least constant over time.
Digging further into that, I found that the process produces an huge number of syscalls which is most likely the reason for the amount of context switches (as syscalls require context switches).
Here are (roughly) ten seconds of tracing the number of syscalls per process (some omitted). The last two ("sqlx-sqlite-wor" and "tokio-runtime-w") are the threads spawned by nym-gateway:
So that's more than 6k syscalls per second for the "nym threads" combined. Which is a lot but not critically high. Nevertheless I cannot see how that corresponds to the fact that it's mostly idling. That is, the primary thing that it does is probably to create cover traffic. How can that be so heavy on syscalls?
Sidenote: The gateway (while being mostly idle) produces a constant network traffic of roughly 5Mbps rx + 4.5Mbps tx = 9.5Mbps. This splits up into (roughly) 60% for port 1789 and 40% for port 9000. That seems to me like a lot for "cover traffic."
Next I sampled (roughly) ten seconds to trace the number of syscalls per syscall (some omitted):
So this looks a lot like a lot of read/write/lock activities. This corresponds to os-level metrics that show that the process (with two clients attached most of the time) is constantly performing 300 write ops/s. Which is also something I cannot see a plausible reason for:
Additionally, the amount of consumed CPU system-time relative to user-time is very high. This is also a plausible symptom corresponding to the high rate of syscalls:
From some experiments with attaching an additional (=third) client to the gateway, I would further estimate that each connected client adds roughly another 150 write ops/s:
The observations combined suggest that a gateway will not scale above a few (<20 or so) connected clients. A load test (i.e. attaching a lot of clients to a specific gateway) should be done to validate this. A workable way to do this might be to use the members of the current grantees group.
Expected behaviour
Given the network structure, a gateway should (imho) at least be capable of serving a few hundred clients. And being able to serve a few thousand clients would certainly also not be an exaggerated scenario.
I'd think that in order to reach a reasonable efficiency, it will be required to severely (thinking 1-2 orders of magnitude) bring down the number of syscalls as well as the number of disk writes. They might be caused by libs or frameworks which are currently in use.
Describe the issue
I run the nym-gateway for a while as part of the shipyard program. Most of the time it has two socks-clients attached which are not reachable by anyone except me.
Being puzzled about some strange-looking os-level metrics I found that the number of context switches of the gateway process is suspiciously high.
The process had been running for roughly six days at that moment. I started to take hourly snapshots of the amount of context switches for two weeks to learn how this evolves over time. They show that the values are very high on average but at least constant over time.
Digging further into that, I found that the process produces an huge number of syscalls which is most likely the reason for the amount of context switches (as syscalls require context switches).
Here are (roughly) ten seconds of tracing the number of syscalls per process (some omitted). The last two ("sqlx-sqlite-wor" and "tokio-runtime-w") are the threads spawned by nym-gateway:
So that's more than 6k syscalls per second for the "nym threads" combined. Which is a lot but not critically high. Nevertheless I cannot see how that corresponds to the fact that it's mostly idling. That is, the primary thing that it does is probably to create cover traffic. How can that be so heavy on syscalls?
Sidenote: The gateway (while being mostly idle) produces a constant network traffic of roughly 5Mbps rx + 4.5Mbps tx = 9.5Mbps. This splits up into (roughly) 60% for port 1789 and 40% for port 9000. That seems to me like a lot for "cover traffic."
Next I sampled (roughly) ten seconds to trace the number of syscalls per syscall (some omitted):
So this looks a lot like a lot of read/write/lock activities. This corresponds to os-level metrics that show that the process (with two clients attached most of the time) is constantly performing 300 write ops/s. Which is also something I cannot see a plausible reason for:
Additionally, the amount of consumed CPU system-time relative to user-time is very high. This is also a plausible symptom corresponding to the high rate of syscalls:
From some experiments with attaching an additional (=third) client to the gateway, I would further estimate that each connected client adds roughly another 150 write ops/s:
The observations combined suggest that a gateway will not scale above a few (<20 or so) connected clients. A load test (i.e. attaching a lot of clients to a specific gateway) should be done to validate this. A workable way to do this might be to use the members of the current grantees group.
Expected behaviour
Given the network structure, a gateway should (imho) at least be capable of serving a few hundred clients. And being able to serve a few thousand clients would certainly also not be an exaggerated scenario.
I'd think that in order to reach a reasonable efficiency, it will be required to severely (thinking 1-2 orders of magnitude) bring down the number of syscalls as well as the number of disk writes. They might be caused by libs or frameworks which are currently in use.
Steps to Reproduce
Which area of Nym were you using?