Journalized activity recorder for backup and restore

Lyndon-Li commented 1 year ago

At present, for a backup or restore, users need to collect information from multiple places, i.e., from various CRs, from various logs, etc., to tell what has exactly done. In the other words, critical information are not listed centrally in a journal style for Velero backups and restores. Moreover, the information in the logs are getting increasingly complicated.

One possible solution is to use the Event mechanism:

Create Event recorder for Backup and Restore CR
Record the critical steps and info as Events along the running of backup/restore
In the same cluster, Velero doesn't need to do anything more, users just need to do kubectl describe
To support backup sync, Velero needs to backup the Event objects as part of the backup, just like backup logs

shawn-hurley commented 1 year ago

I think this would also be a great change for third-party data movers to have a common way to give information to the user during the backup and restore.

I also love the UX of this personally as a user of k8s. I am so used to getting this information with kubectl describe

Lyndon-Li commented 1 year ago

To cover 3rd-party data movers, one possible way is that we provide this journalized event mechanism as a generic mechanism of Velero backup/restore workflow, so that these events go together with Velero backup/restore no matter which module generates them.

Lyndon-Li commented 1 year ago

One thing that may be a hinder of the proposal to use Kubernetes Event mechanism is:

Kubernetes' event resources have associated TTLs which cannot be disabled
The default TTL value is 1 hour
The TTL value is nearly impossible to reconfigured (it is an api-server parameter to the entire etcd storage)

As a result, if we store the backup & restore events based on the Events, the events will be cleared after 1 hour, for the long running backups, this is not enough.

This means:

Even in the same cluster, Velero needs to back up these events for each backup timely
Even in the same cluster, the entire events for each backup need to be retrieved from the backup tarball, so kubectl get backup -n velero should not be used

Then we will need to compare whether this is simpler enough than the solution to create a dedicate event mechanism from Velero.

shawn-hurley commented 1 year ago

I think there are two concerrns here:

A user having a k8s native way, to determine how the backup is currently acting
A complete recording of every action a user could look over after the fact

For the first, events should be used because it will alert the user to what is happening.

For 2, the events are stored in the audit log. IIRC would be a place to point users for 2, or we can create a new log file that just TEE records the events but saves them in the backup repository.

IDK, does that make sense?

Lyndon-Li commented 1 year ago

Personally, if the events only last 1 hour, I think even for 1, it will lose lots of values --- users will not timely check the events during the backup, especially for schedule backups. Think about what the schedule backups are being used, users usually schedule the backups in a window of time when the environment is not heavily used.

shawn-hurley commented 1 year ago

Hm sounds like a different use case to me, TBH.

I think that when I create a backup, you can tell me <we have done X, we are doing Y> and keep this info coming (you can see the "got event eight times over the last 5 min". This helps you to know that things are being worked.

It sounds like you are focused on the case of me coming in on Monday morning, and my backup which is supposed to run on Sunday at 8 pm or something, has failed. Here I agree having a journaled log in the backup (like the TEE approach I talked about) would be useful.

Sounds like you just disagree that the first use case is relevant or needed?

Lyndon-Li commented 1 year ago

I think it is less valuable if it can only support the first case as you mentioned, because:

It is not a common practice for a backup user to create a backup and then timely watch the backup. Especially for schedule backups, which are usually backend tasks at non-working time.
People will be annoying if the can see something (events within the latest 1 hours), but they cannot see all

Let me discuss this within the team and address:

What everyone things the value based on the current situation
Whether we want to do something for it in 1.13

shawn-hurley commented 1 year ago

I disagree with

People will be annoying if the can see something (events within the latest 1 hours), but they cannot see all

This is how kube events work. This is known and works for long-running pods, PVs, PVCs, Jobs, etc.

Please consider making it easier for users to use normal k8s tooling to debug rather than using something special. I agree on something special for the second case as there is no other option. And as stated, just adding a call to EventRecorder when you add a journal log is minimal complexity.

I also can't entirely agree that the only way someone uses this is from schedule backups. We have many use cases where users watch the backups, and this would be very helpful.

shawn-hurley commented 1 year ago

I also think that we should have a conversation on this in the open, can we add it to the next community meeting instead?

Lyndon-Li commented 1 year ago

Sure, let's try to reach more people and hear more voices.

A conclusion of my personal opinions, if the solution could meet both 1 and 2, I will fully vote it. If it only meets 1, I will not be confident in its values. And I also believe even for Kubernetes itself, this event implementation is not perfect --- it is actually a compromise to etcd's low performance in handling the loads in this scenario.

My understanding may be wrong. So let's see more comments later from others.

weshayutin commented 7 months ago

++ love the idea

vmware-tanzu / velero

Journalized activity recorder for backup and restore #6606