vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Velero server panic if plugin print log with "WithError" #5972

Closed Lyndon-Li closed 1 year ago

Lyndon-Li commented 1 year ago

As the [logrus code](https://github.com/sirupsen/logrus/blob/v1.8.1/json_formatter.go#:~:text=logrus/issues/137-,data,(),-default), the error field will be converted to a string if the log format is json:

func (f *JSONFormatter) Format(entry *Entry) ([]byte, error) {
    data := make(Fields, len(entry.Data)+4)
    for k, v := range entry.Data {
        switch v := v.(type) {
        case error:
            // Otherwise errors are ignored by `encoding/json`
            // https://github.com/sirupsen/logrus/issues/137
            data[k] = v.Error()
        default:
            data[k] = v
        }
    }

When the plugin generate log with logger.WithError, the plugin client gets the error field as string instead of error interface.

The backup controller registers a log_counter_hook to the backup logger, while in log_counter_hook.go, the assumes the error field as error interface only:

    if isErrorFieldPresent {
        entryMessage = fmt.Sprintf("%s error: /%s", entryMessage, errorField.(error).Error())
    }

As a result, there will be a panic like below:

panic: interface conversion: string is not error: missing method Error

goroutine 1079 [running]:
github.com/vmware-tanzu/velero/pkg/util/logging.(*LogHook).Fire(0xc000571230, 0xc0003f2620)
    /go/src/github.com/vmware-tanzu/velero/pkg/util/logging/log_counter_hook.go:74 +0x3f3
github.com/sirupsen/logrus.LevelHooks.Fire(0xc000a53af0?, 0xa53ac0?, 0xc000000005?)
    /go/pkg/mod/github.com/sirupsen/logrus@v1.8.1/hooks.go:28 +0x82
github.com/sirupsen/logrus.(*Entry).fireHooks(0xc0003f2620)
    /go/pkg/mod/github.com/sirupsen/logrus@v1.8.1/entry.go:272 +0x1dc
github.com/sirupsen/logrus.(*Entry).log(0xc0003f2540, 0x2, {0xc000b256c0, 0xe})
    /go/pkg/mod/github.com/sirupsen/logrus@v1.8.1/entry.go:241 +0x34f
github.com/sirupsen/logrus.(*Entry).Log(0xc0003f2540, 0x2, {0xc000a53d98?, 0xc000a53da8?, 0xc000a53da8?})
    /go/pkg/mod/github.com/sirupsen/logrus@v1.8.1/entry.go:293 +0x4f
github.com/sirupsen/logrus.(*Entry).Error(...)
    /go/pkg/mod/github.com/sirupsen/logrus@v1.8.1/entry.go:322
github.com/vmware-tanzu/velero/pkg/plugin/clientmgmt/process.(*logrusAdapter).Error(0xc0004ca420, {0xc000b25630, 0xe}, {0xc000ac0e00?, 0x1?, 0xc00068eea0?})
    /go/src/github.com/vmware-tanzu/velero/pkg/plugin/clientmgmt/process/logrus_adapter.go:90 +0x9c
github.com/hashicorp/go-plugin.(*Client).logStderr(0xc000b656b0, {0x2a5f600?, 0xc000481308?})
    /go/pkg/mod/github.com/hashicorp/go-plugin@v1.4.3/client.go:1039 +0xb8a
created by github.com/hashicorp/go-plugin.(*Client).Start
    /go/pkg/mod/github.com/hashicorp/go-plugin@v1.4.3/client.go:611 +0x144c
lnehrin commented 1 year ago

Yeah, this is killer. Trying to run Velero v1.10.2 with AWS plugin v1.6.1 and I cannot get past this. Makes Velero unusable.

Lyndon-Li commented 1 year ago

@lnehrin We expect this problem won't happen in the normal case because we have checked the plugins and found plugins seldomly print log with WithError. So as the initial discussion, we prepare to fix in the next release, 1.12.

If you face the problem in v1.10.2, this is not the case, we will not only need to fix this in 1.11 but also need to fix it in v1.10.x.

So please share us the Velero log bundle when the problem happened in your env, let's check which log caused the problem and redecide.

lnehrin commented 1 year ago

velero.log

I tried various installations of velero, with and without CSI, and even no plugins at all. I thought I could migrate between EKS clusters, exactly as the example in the docs. Maybe my source cluster EKS v1.21.14 is just too unhappy. When I do an identical installation on a new EKS cluster v1.24.8 (no CSI) - a backup succeeds. Unfortunately my use case is to migrate namespaces from the old cluster v1.21.14 to the new cluster v1.24.8. I think I'm going to be stuck with manually migrating the PVs and fresh helm installations of what I need to get installed.

peixian commented 1 year ago

I'm also seeing this in v1.10.2, log attached. velero_out.log

Lyndon-Li commented 1 year ago

Thank all, we will fix this in v1.11 and v1.10.3

Lyndon-Li commented 1 year ago

Close as completed