Include additional helpful information in default DataUpload/DataDownload display.

mrnold commented 3 months ago

Describe the problem/challenge you have We had users evaluating backup performance by watching the progress of DataUploads, basically like this:

$ oc get dataupload --all-namespaces -w NAMESPACE NAME STATUS STARTED BYTES DONE TOTAL BYTES STORAGE LOCATION AGE NODE openshift-adp test ts-velero-test-1 0s openshift-adp test ts-velero-test-1 0s openshift-adp test Accepted ts-velero-test-1 0s openshift-adp test Prepared ts-velero-test-1 62s ip.lan openshift-adp test InProgress 0s ts-velero-test-1 62s ip.lan openshift-adp test InProgress 13s 555745280 ts-velero-test-1 75s ip.lan openshift-adp test InProgress 23s 1073741824 ts-velero-test-1 85s ip.lan openshift-adp test InProgress 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan openshift-adp test Completed 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan

It would have been helpful to see a little more information about the transfer, to help identify where to allocate resources for their cluster.

Describe the solution you'd like I would like to add another field or two to this output so we can see elapsed time and transfer rates at a glance. This would help them estimate what backups can fit in their time window. It would also be good to see a count of the actual bytes moved over the network, taking kopia's incremental snapshots into account.

Anything else you would like to add: I have not yet researched what kind of progress reporting kopia provides, please chime in if this sounds infeasible.

Environment:

Velero version (use velero version):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

:+1: for "The project would be better with this feature added"
:-1: for "This feature will not enhance the project in a meaningful way"

reasonerjt commented 3 months ago

elapsed time - This is equivalent to the value in STARTED field transfer rate - This is more straight-forward, but I'm not sure if we can easily map it to a field in the CRD.

Given the information has been provided in the CR of dataupload I wanna tentatively move it to icebox

sseago commented 3 months ago

@reasonerjt elapsed time is only equivalent to STARTED before the DU completes. The main point here is to make it easy for a user looking at kubectl get du output to see how long each DU took to process.

mrnold commented 3 months ago

I opened #8128 to hold some work that will eventually show the difference in a clearer way.

Lyndon-Li commented 3 months ago

@mrnold I don't think we need to add any new fields in DUCR/DDCR for this issue: elapsed time = [current time] - startTimeStamp = STARTED transfer rate = doneBytes / elapsed time

What we need here -- startTimeStamp and doneBytes have both included in the DUCR/DDCR.

Therefore, we can see minor gap for transfer rate (elapsed time is already there as STARTED column). But the real gap is how to use the kubuilder marker +kubebuilder:printcolumn to show the calculation result. So please check with this direction first, we don't want to add redundant info into DUCR/DDCR.

Lyndon-Li commented 3 months ago

@sseago

elapsed time is only equivalent to STARTED before the DU completes.

I think the output of kubectl get du is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info, velero backup describe is a better choice. E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.

mrnold commented 3 months ago

Just to summarize a few comments from the community meeting: one of the goals is to allow users to see that incremental backups are working. This is more important than showing a transfer rate. I think doneBytes might not be enough to show that successive backups transfer less data than an initial backup. I am looking into what options Kopia has for showing bytes moved.

I will see if I can get printcolumn to show calculated results, but I haven't yet figured out a way to do this without changing the CRD.

Lyndon-Li commented 3 months ago

@mrnold

allow users to see that incremental backups are working

The cachedBytes as the code here could fulfill this purpose. It means the bytes that has been skipped due to the incremental behavior, it monotonically increases along the running of the backup.

The current code has retrieved this info but not set to DUCR, so we need to add one more field to DUCR's DataUploadStatus and also add a new column through +kubebuilder:printcolumn

sseago commented 3 months ago

@sseago

elapsed time is only equivalent to STARTED before the DU completes.

I think the output of kubectl get du is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info, velero backup describe is a better choice. E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.

Hmm. OK -- so maybe the answer is to enhance backup describe to include the elapsed time for each du? What we want here is some way for a user to see in one place, how long each of the du operations took. This could be in the output of kubectl get du or it could be in the output of velero backup describe. Same for the other new field we want to add around transferred and/or cached bytes.

@Lyndon-Li Does moving this functionality to backup describe resolve your objections? @mrnold Does backup describe work for your use case as well as kubectl get du?

sseago commented 3 months ago

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

Lyndon-Li commented 3 months ago

@sseago Let me conclude the info I think we need to have for the backup/restore performance observability:

elapsed time

It means the total elapsed time for the data transfer (it starts to count only after data transfer starts)
If we want to check it realtime during the backup running though kubectl get du, we already have it, that is, the STARTED column
We also want to have a permanent statistic after backup completes, then we need to have it in velero backup describe because DU doesn't guarantee existing after backup completes and it is not so meaningful to check static info through kubectl get du

cached bytes

It means the total bytes for the files that have been skipped by the incremental backup
For full backup, it should be equal to total bytes and for incremental backup, it should normally be less than totalBytes
It changes from time to time during the backup running
We probably need to change the name. "cached bytes" is not straightforward

throughput

It gives the most intuitive observability for how the backup performs
Some software make it as doneBytes/elapsedTime; some others make it as (doneBytes - cachedBytes) / elapsedTime. Both is acceptable, the latter just makes a trick to make the performance looks better

For all these info:

First thing first is to have them in velero backup describe. To make it, I think we can add the info to VolumeBackupInfo/VolumeRestoreInfo, so that we don't need to download every DU/DD during the describe.
It is icing on the cake to have them with kubectl get du. Because kubectl get du is for realtime observability only, so it improves user experience since users see the info dynamically changed and earlier.

Lyndon-Li commented 3 months ago

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

We don't need to use startTimestamp to calculate elapsedTime. The STARTED column is already elapsedTime (before the backup/restore completes).

sseago commented 3 months ago

Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.

We don't need to use startTimestamp to calculate elapsedTime. The STARTED column is already elapsedTime (before the backup/restore completes).

It isn't though. If you have 2 DUs that run sequentially in a backup (on same node, with parallel execution disabled), and each takes 10 minutes, at minute 19, you'll see DU1 showing STARTED at "19m" and DU2 showing STARTED at "9m" -- but elapsed time should tell us how long the DU has been in progress until completion -- so those fields would need to show "10m" and "9m".

Lyndon-Li commented 3 months ago

U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"

Yes, it is. but just keep my point in mind that kubectl get du is only meaningful during that DU/DD running. Once it completes, we should check with velero backup describe. If you agree, we will not need to add one more field in DUCR/DDCR.

sseago commented 3 months ago

U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"

Yes, it is. but just keep my point in mind that kubectl get du is only meaningful during that DU/DD running. Once it completes, we should check with velero backup describe. If you agree, we will not need to add one more field in DUCR/DDCR.

OK, lets step back a bit. If I'm a user monitoring a running backup and my main concern is DU progress, I want to look in one place to see: 1) which DUs are currently in progress 2) for this backup, how long has each DU taken (either to completion or to now, still running), how much is done (for still running ones), and total bytes (and eventually some data indicating skipped/transferred for incremental volume backup)

Currently, the only place to see any of this is kubectl get du. There we see in progress status, bytesDone, totalBytes, but we don't see the correct elapsed time for completed, only for in-progress.

One possibility is to enhance the du display columns. Of course, to get elapsed time would require a new field.

Another would be to enhance velero backup describe -- perhaps a du summary which gives similar information to kubectl get du, with proper calculated field for elapsed time. If done here, we wouldn't need the CRD change. But the important point is that this would need to work for in-progress backups as well as completed ones, since this is needed for real-time observability.

Either way, we need a real-time view (changes while backup and du progress) that shows completed, in-progress, and queued DUs for the current backup with elapsed time, bytes progress, etc. as appropriate.

Lyndon-Li commented 3 months ago

@sseago After a further thinking, below may be better for elapsedTime:

Add the elapsedTime field into DUCR/DDCR
Make the STARTED column show the time since the DU/DD is Accepted

Use elapsedTime over STARTED because it is updated simultaneously with doneBytes and cachedBytes. Then when we calculate throughput, the result is always correct; while STARTED or currentTime - startTimestamp is updated by kubernetes, if we use it to calculate throughput the number may be weirdly jumping if doneBytes / cachedBytes is not updated timely.

Changing STARTED because it is duplicated with elapsedTime. But rather than simply removing it, we can change it to express other info --- the time to prepare the snasphot/restored volume.

reasonerjt commented 1 month ago

Per discussion, let's keep it ice-box given we don't have an agreement on how to demonstrate the incremental backup. And we need a design.

mrnold commented 1 month ago

I tried using cachedBytes, but did not have much success. It seems like kopia calculates cachedBytes up front, so using (doneBytes-cachedBytes)/elapsedTime for throughput shows a negative number for most of an incremental backup transfer.

I see there is also an uploadedBytes that is supposed to update when bytes are uploaded to storage, but in my tests I did not see it change away from 0. So I agree I don't really have a good base for a design yet.

vmware-tanzu / velero

Include additional helpful information in default DataUpload/DataDownload display. #8095