Open dazhiw opened 2 years ago
/reopen
@divyenpatel: You can't reopen an issue/PR unless you authored it or you are a collaborator.
@dazhiw
@pradeep-hegde change is partially fixing the issue and not getting much-needed performance improvement. Can we keep this issue open?
@wdazhi2020: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen
@dazhiw: Reopened this issue.
@divyenpatel did we compare with intree Windows? It is known that Windows consumes more resources than Linux. The question is do we think there's a regression between intree vs csi Windows?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
@alexander-ding please post the results of your experiments here
@alexander-ding please post the results of your experiments here
The experiments I ran were not related to resource usage unfortunately. I was measuring latency and throughput for CSI Proxy. The resource usage related experiments were on Linux nodes.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/remove-lifecycle rotten /lifecycle frozen
Circling this back.. tracked from a lot of calls to Get-Item -Path $Env:mountpath).Target
that seems to hang in process explorer, even when all the pods with PVCs are deleted.
I0416 10:23:52.948472 9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\83a8a951-2b74-4cee-bafc-b6cb6790ddbd\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.948937 9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\0d746ae0-6a44-4b78-9a5b-d05fa146fd06\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.953077 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:52.955659 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.322869 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.483584 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.486178 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.519391 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:55.167002 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:56.050764 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:57.028808 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:58.569111 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.037196 9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.052682 9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.142338 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.402459 9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.629165 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:24:00.037973 9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
Dumping the stacktrace it shows a very odd call to the same function, what is very different from the normal behavior.
runtime/debug.Stack() [117/1942]
/home/aknabben/.asdf/installs/golang/1.22.1/go/src/runtime/debug/stack.go:24 +0x5e
github.com/kubernetes-csi/csi-proxy/pkg/utils.RunPowershellCmd({0x1756b3d, 0x26}, {0xc0002e3c40, 0x1, 0x1?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/utils/utils.go:17 +0xee
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eb200?, 0xc000406400?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:283 +0xa5
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eaf00?, 0xc0000d1c00?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218c00?, 0xc0001de400?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eac00?, 0xc0000d1400?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0003ca300?, 0xc00007d000?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218900?, 0xc000117c00?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000ea900?, 0xc0000d0c00?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218600?, 0xc000117400?})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
...
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.VolumeAPI.GetVolumeIDFromTargetPath({}, {0xc0000f3290, 0x85})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:271 +0x25
github.com/kubernetes-csi/csi-proxy/pkg/server/volume.(*Server).GetVolumeIDFromTargetPath(0xc0003b80a0, {0x0?, 0xc000161008?}, 0xc000415360, {0x1, 0x2, 0x0, {0x173b627, 0x2}})
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/server.go:281 +0x133
github.com/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1.(*versionedAPI).GetVolumeIDFromTargetPath(0xc0003b81c0, {0x180b870, 0xc0004f45d0}, 0xc0005c0c40)
/home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1/server_generated.go:74 +0xbc
github.com/kubernetes-csi/csi-proxy/client/api/volume/v1._Volume_GetVolumeIDFromTargetPath_Handler({0x170c140, 0xc0003b81c0}, {0x180b870, 0xc0004f45d0}, 0xc00011ee00, 0x0)
/home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/github.com/kubernetes-csi/csi-proxy/client/api/volume/v1/api.pb.go:1799 +0x1a6
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003f1a40, {0x180b870, 0xc0004f4420}, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0, 0xc0003b34d0, 0x1c09520, 0x0)
/home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1343 +0xdd1
google.golang.org/grpc.(*Server).handleStream(0xc0003f1a40, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0)
/home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1737 +0xc47
google.golang.org/grpc.(*Server).serveStreams.func1.1()
/home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 122
/home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:997 +0x136
for VolumeStats, we could use windows.GetDiskFreeSpaceEx
api instead, that could avoid such issue, check the refined PR here:
https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/2287
The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?
Moving from powershell to win32 API would be a nice change at the expense of making the code harder to read e.g. https://github.com/kubernetes-csi/csi-proxy/blob/07be14dabebca5d1590e670f9a796fcfce44d6e6/pkg/os/disk/api.go#L190 is harder to read than using a Powershell command, for this case I think there wasn't a powershell command so that's why the team implemented it with the Win32 API directly.
For the current issue, @knabben did a great analysis in https://github.com/kubernetes-csi/csi-proxy/issues/193#issuecomment-2059665463 about the current implementation, and also opened a PR fixing it in https://github.com/kubernetes-csi/csi-proxy/pull/336 but we haven't been able to merge it because presubmit tests are broken because of a Windows module that was disabled in Github Actions. I was trying to fix it in https://github.com/kubernetes-csi/csi-proxy/pull/348 but didn't have luck.
For the next steps, the plan of action was to:
The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?
Changes would be blocked by the presubmit error, we should try to fix the SMB tests first.
If we're not keen on using Win32 API directly because of readability, then we should strongly consider using the MSFT_Disk & Win32_DiskPartition WMI classes, which is what PowerShell is wrapping, but without the .NET overhead.
You may check how it would be like to switch to use WMI classes from this WIP branch. https://github.com/laozc/csi-proxy/commit/b7483ac6f277d04f5c6d401be88a4d5ea3ec9340
What happened: We are doing some performance tests for vsphere-csi-driver (https://github.com/kubernetes-sigs/vsphere-csi-driver) on k8s cluster with 20 windows worker nodes. Each node has 4 CPUs and 16GB of memory. k8s version is 1.22.3, windows version is Windows Server 2019. the vsphere csi node plugin uses csi-proxy for privileged operations on host devices. The test repeatedly issues CreatePod followed by DeletePod. Each Pod just starts one container with image "mcr.microsoft.com/oss/kubernetes/pause:1.4.1", and mounts one persistent volume.
During the test we observed high cpu usage on the windows worker nodes, and most of the cpu usage came from the powershell processes triggered by csi-proxy. With 1.4 Pod creations per second, the average cpu usage on each node is about 75% of the total cpu capacity, the CPU cost of powershell processes triggered by csi-proxy is about 45% of the total CPU capacity; With 2.4 Pod creations per second, the cpu usage on each node reached almost 100%, and the CPU cost of powershell processes triggered by csi-proxy is about 68%. As a comparison, for vsphere-csi-driver on Linux nodes, with 3 Pod creations per second the cpu usage on each node is constantly below 10%.
What you expected to happen: Reduce the cpu cost of csi-proxy.
How to reproduce it: deploy vsphere-csi-driver on k8s cluster with windows nodes, and create Pod with PV mount, then delete Pod.
Anything else we need to know?:
Environment:
kubectl version
): 1.22.3uname -a
):