kubernetes-csi / csi-proxy

CSI Proxy utility to enable CSI Plugins on Windows
Apache License 2.0
58 stars 59 forks source link

High cpu usage of powershell processes triggered by csi-proxy #193

Open dazhiw opened 2 years ago

dazhiw commented 2 years ago

What happened: We are doing some performance tests for vsphere-csi-driver (https://github.com/kubernetes-sigs/vsphere-csi-driver) on k8s cluster with 20 windows worker nodes. Each node has 4 CPUs and 16GB of memory. k8s version is 1.22.3, windows version is Windows Server 2019. the vsphere csi node plugin uses csi-proxy for privileged operations on host devices. The test repeatedly issues CreatePod followed by DeletePod. Each Pod just starts one container with image "mcr.microsoft.com/oss/kubernetes/pause:1.4.1", and mounts one persistent volume.

During the test we observed high cpu usage on the windows worker nodes, and most of the cpu usage came from the powershell processes triggered by csi-proxy. With 1.4 Pod creations per second, the average cpu usage on each node is about 75% of the total cpu capacity, the CPU cost of powershell processes triggered by csi-proxy is about 45% of the total CPU capacity; With 2.4 Pod creations per second, the cpu usage on each node reached almost 100%, and the CPU cost of powershell processes triggered by csi-proxy is about 68%. As a comparison, for vsphere-csi-driver on Linux nodes, with 3 Pod creations per second the cpu usage on each node is constantly below 10%.

What you expected to happen: Reduce the cpu cost of csi-proxy.

How to reproduce it: deploy vsphere-csi-driver on k8s cluster with windows nodes, and create Pod with PV mount, then delete Pod.

Anything else we need to know?:

Environment:

divyenpatel commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@divyenpatel: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-csi/csi-proxy/issues/193#issuecomment-1171533555): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
divyenpatel commented 2 years ago

@dazhiw

@pradeep-hegde change is partially fixing the issue and not getting much-needed performance improvement. Can we keep this issue open?

k8s-ci-robot commented 2 years ago

@wdazhi2020: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-csi/csi-proxy/issues/193#issuecomment-1171578642): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dazhiw commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@dazhiw: Reopened this issue.

In response to [this](https://github.com/kubernetes-csi/csi-proxy/issues/193#issuecomment-1171581137): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
msau42 commented 2 years ago

@divyenpatel did we compare with intree Windows? It is known that Windows consumes more resources than Linux. The question is do we think there's a regression between intree vs csi Windows?

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mauriciopoppe commented 2 years ago

/remove-lifecycle stale

mauriciopoppe commented 1 year ago

@alexander-ding please post the results of your experiments here

alexander-ding commented 1 year ago

@alexander-ding please post the results of your experiments here

The experiments I ran were not related to resource usage unfortunately. I was measuring latency and throughput for CSI Proxy. The resource usage related experiments were on Linux nodes.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-csi/csi-proxy/issues/193#issuecomment-1515738577): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
mauriciopoppe commented 6 months ago

/remove-lifecycle rotten /lifecycle frozen

knabben commented 6 months ago

Circling this back.. tracked from a lot of calls to Get-Item -Path $Env:mountpath).Target that seems to hang in process explorer, even when all the pods with PVCs are deleted.

I0416 10:23:52.948472    9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\83a8a951-2b74-4cee-bafc-b6cb6790ddbd\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.948937    9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\0d746ae0-6a44-4b78-9a5b-d05fa146fd06\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.953077    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:52.955659    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.322869    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.483584    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.486178    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.519391    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:55.167002    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:56.050764    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:57.028808    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:58.569111    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.037196    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.052682    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.142338    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.402459    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.629165    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:24:00.037973    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"

Dumping the stacktrace it shows a very odd call to the same function, what is very different from the normal behavior.

runtime/debug.Stack()                                                                                                                                                                                                                                                   [117/1942]
        /home/aknabben/.asdf/installs/golang/1.22.1/go/src/runtime/debug/stack.go:24 +0x5e
github.com/kubernetes-csi/csi-proxy/pkg/utils.RunPowershellCmd({0x1756b3d, 0x26}, {0xc0002e3c40, 0x1, 0x1?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/utils/utils.go:17 +0xee
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eb200?, 0xc000406400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:283 +0xa5
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eaf00?, 0xc0000d1c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218c00?, 0xc0001de400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eac00?, 0xc0000d1400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0003ca300?, 0xc00007d000?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218900?, 0xc000117c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000ea900?, 0xc0000d0c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218600?, 0xc000117400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
...
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.VolumeAPI.GetVolumeIDFromTargetPath({}, {0xc0000f3290, 0x85})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:271 +0x25
github.com/kubernetes-csi/csi-proxy/pkg/server/volume.(*Server).GetVolumeIDFromTargetPath(0xc0003b80a0, {0x0?, 0xc000161008?}, 0xc000415360, {0x1, 0x2, 0x0, {0x173b627, 0x2}})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/server.go:281 +0x133
github.com/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1.(*versionedAPI).GetVolumeIDFromTargetPath(0xc0003b81c0, {0x180b870, 0xc0004f45d0}, 0xc0005c0c40)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1/server_generated.go:74 +0xbc
github.com/kubernetes-csi/csi-proxy/client/api/volume/v1._Volume_GetVolumeIDFromTargetPath_Handler({0x170c140, 0xc0003b81c0}, {0x180b870, 0xc0004f45d0}, 0xc00011ee00, 0x0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/github.com/kubernetes-csi/csi-proxy/client/api/volume/v1/api.pb.go:1799 +0x1a6
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003f1a40, {0x180b870, 0xc0004f4420}, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0, 0xc0003b34d0, 0x1c09520, 0x0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1343 +0xdd1
google.golang.org/grpc.(*Server).handleStream(0xc0003f1a40, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1737 +0xc47
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 122
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:997 +0x136
andyzhangx commented 5 months ago

for VolumeStats, we could use windows.GetDiskFreeSpaceEx api instead, that could avoid such issue, check the refined PR here: https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/2287

randomvariable commented 1 week ago

The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?

mauriciopoppe commented 1 week ago

Moving from powershell to win32 API would be a nice change at the expense of making the code harder to read e.g. https://github.com/kubernetes-csi/csi-proxy/blob/07be14dabebca5d1590e670f9a796fcfce44d6e6/pkg/os/disk/api.go#L190 is harder to read than using a Powershell command, for this case I think there wasn't a powershell command so that's why the team implemented it with the Win32 API directly.

For the current issue, @knabben did a great analysis in https://github.com/kubernetes-csi/csi-proxy/issues/193#issuecomment-2059665463 about the current implementation, and also opened a PR fixing it in https://github.com/kubernetes-csi/csi-proxy/pull/336 but we haven't been able to merge it because presubmit tests are broken because of a Windows module that was disabled in Github Actions. I was trying to fix it in https://github.com/kubernetes-csi/csi-proxy/pull/348 but didn't have luck.

For the next steps, the plan of action was to:

The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?

Changes would be blocked by the presubmit error, we should try to fix the SMB tests first.

randomvariable commented 1 week ago

If we're not keen on using Win32 API directly because of readability, then we should strongly consider using the MSFT_Disk & Win32_DiskPartition WMI classes, which is what PowerShell is wrapping, but without the .NET overhead.

laozc commented 1 week ago

You may check how it would be like to switch to use WMI classes from this WIP branch. https://github.com/laozc/csi-proxy/commit/b7483ac6f277d04f5c6d401be88a4d5ea3ec9340