Open nomaster opened 1 year ago
There is expected to be a timeout since velero can't wait all the time. As for TLS handshake timeout, you may have some env issue, please make sure you unset any http_proxy and https_proxy. If that doesn't work, Sometimes it comes down to resource issue, you cloud check if all kind of resources(e.g. memory, network) in the node is sufficient.
Yes of course, every operation needs to time out at some point. What I miss is Velero to try again if this happens - or any other error occurs.
I'm not using a proxy. Resources should be sufficient: I had OOM kills in the past but they vanished, since I configured a memory request of 512 MiB for the pod.
@nomaster, Did your issue resolve? I do have the same issue. Here is the error from today.
time="2023-02-01T14:00:30Z" level=debug msg="Error from backupItemActionResolver.ResolveActions" backup=velero/velero-astro-daily-20230201140019 error="rpc error: code = Unknown desc = Get \"https://xx.x.x.x:443/api\": net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/backup.go:219" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*kubernetesBackupper).BackupWithResolvers" logSource="pkg/backup/backup.go:219"
Can someone help on this?
@nomaster, Did your issue resolve? I do have the same issue. Here is the error from today.
Unfortunately not. The error still comes up for me every few days.
We need some kind of timeout dynamic setting so we can control how long Velero can wait before timing out.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm still seeing this issue
I'm also having this problem
time="2023-05-22T02:00:44Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unknown desc = Get \"https://10.0.0.1:443/api?timeout=32s\": net/http: TLS handshake timeout" key=velero/generalbackup01backup01-20230522020033 logSource="pkg/controller/backup_controller.go:282"
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
I haven't seen this issue anymore. Maybe a fix has been included in the recent releases?
Anyone else?
Still seeing this issue (but rarely). AKS 1.26.3, Velero 1.11.0, Azure Plugin 1.7.0
facing the same error regularly with AKS v1.27.3, Velero v1.11.1, Azure Plugin v1.7.1 We are using two schedules at the same time, one always succeeds
Hi,
we are seeing the handshake problem very regularly on our hourly cron jobs:
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
velero-daily-backup-20231104070024 Completed 0 0 2023-11-04 08:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231104060024 Completed 0 0 2023-11-04 07:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231104050024 Failed 0 0 2023-11-04 06:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231104040024 Completed 0 0 2023-11-04 05:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231104030024 Failed 0 0 2023-11-04 04:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231104020024 Failed 0 0 2023-11-04 03:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231104010024 Failed 0 0 2023-11-04 02:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231104000024 Completed 0 0 2023-11-04 01:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103230024 Completed 0 0 2023-11-04 00:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103220024 Completed 0 0 2023-11-03 23:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103210024 Failed 0 0 2023-11-03 22:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103200024 Completed 0 0 2023-11-03 21:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103190024 Completed 0 0 2023-11-03 20:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103180024 Completed 0 0 2023-11-03 19:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103170024 Completed 0 0 2023-11-03 18:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103160024 Completed 0 0 2023-11-03 17:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103150024 Completed 0 0 2023-11-03 16:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103140024 Failed 0 0 2023-11-03 15:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103130024 Failed 0 0 2023-11-03 14:00:24 +0100 CET 2d default <none>
velero-daily-backup-20231103120624 Completed 0 0 2023-11-03 13:06:24 +0100 CET 2d default <none>
time="2023-11-04T03:00:36Z" level=error msg="backup failed" backuprequest=velero/velero-daily-backup-20231104030024 controller=backup error="rpc error: code = Unknown desc = Get \"https://10.0.0.1:443/api\": net/http: TLS handshake timeout" logSource="pkg/controller/backup_controller.go:290"
time="2023-11-04T02:00:35Z" level=error msg="backup failed" backuprequest=velero/velero-daily-backup-20231104020024 controller=backup error="rpc error: code = Unknown desc = Get \"https://10.0.0.1:443/api\": net/http: TLS handshake timeout" logSource="pkg/controller/backup_controller.go:290"
AKS 1.26.3 / 1.27.3, Velero 1.11.0, Azure Plugin 1.8.1
Also seeing this issue regularly and couldn't pin-point a culprit so far.
AKS 1.26.3 / 1.27.3, Velero 1.11.1, Azure Plugin 1.8.1
Out of curiosity:
We are running a couple of aks clusters, all experiencing the same ... regularly failed backups due of timeouts. Changing the schedule to pin each cluster to a separate time frame did not solve the issue so far. We are wondering if it would be possible to integrate a retry within velero for such cases.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
still same problem with aks v1.27.3, velero v1.12.3 & velero-plugin-for-microsoft-azure v1.8.2
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
@kubecon paris some Azure folks told us that this is an issue on the network layer of AKS. It's hard to debug but they are working on a fix.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
not stale
What steps did you take and what happened:
Backup tasks sometimes fail with 'TLS handshake timeout' when trying to reach the Kubernetes controller.
What did you expect to happen:
Velero should wait for the Kubernetes controller to be reachable again.
The following information will help us better understand what's going on:
log output:
Anything else you would like to add:
Maybe there is a retry loop missing?
Environment:
velero version
): 1.10.0velero client config get features
): EnableCSIkubectl version
): 1.24.6/etc/os-release
): Ubuntu 18.04.6 LTSVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.