Open mrnold opened 3 months ago
elapsed time
- This is equivalent to the value in STARTED
field
transfer rate
- This is more straight-forward, but I'm not sure if we can easily map it to a field in the CRD.
Given the information has been provided in the CR of dataupload
I wanna tentatively move it to icebox
@reasonerjt elapsed time
is only equivalent to STARTED
before the DU completes. The main point here is to make it easy for a user looking at kubectl get du
output to see how long each DU took to process.
I opened #8128 to hold some work that will eventually show the difference in a clearer way.
@mrnold I don't think we need to add any new fields in DUCR/DDCR for this issue: elapsed time = [current time] - startTimeStamp = STARTED transfer rate = doneBytes / elapsed time
What we need here -- startTimeStamp
and doneBytes
have both included in the DUCR/DDCR.
Therefore, we can see minor gap for transfer rate
(elapsed time
is already there as STARTED
column). But the real gap is how to use the kubuilder marker +kubebuilder:printcolumn
to show the calculation result.
So please check with this direction first, we don't want to add redundant info into DUCR/DDCR.
@sseago
elapsed time is only equivalent to STARTED before the DU completes.
I think the output of kubectl get du
is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info, velero backup describe
is a better choice.
E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.
Just to summarize a few comments from the community meeting: one of the goals is to allow users to see that incremental backups are working. This is more important than showing a transfer rate. I think doneBytes might not be enough to show that successive backups transfer less data than an initial backup. I am looking into what options Kopia has for showing bytes moved.
I will see if I can get printcolumn to show calculated results, but I haven't yet figured out a way to do this without changing the CRD.
@mrnold
allow users to see that incremental backups are working
The cachedBytes
as the code here could fulfill this purpose. It means the bytes that has been skipped due to the incremental behavior, it monotonically increases along the running of the backup.
The current code has retrieved this info but not set to DUCR, so we need to add one more field to DUCR's DataUploadStatus and also add a new column through +kubebuilder:printcolumn
@sseago
elapsed time is only equivalent to STARTED before the DU completes.
I think the output of
kubectl get du
is only meaningful to show realtime info during the running of DU/DD. Once DU/DD finishes, to check the static info,velero backup describe
is a better choice. E.g., DU/DD is not a necessary resource after it completes, so there is no guarantee that users could see it.
Hmm. OK -- so maybe the answer is to enhance backup describe to include the elapsed time for each du? What we want here is some way for a user to see in one place, how long each of the du operations took. This could be in the output of kubectl get du
or it could be in the output of velero backup describe
. Same for the other new field we want to add around transferred and/or cached bytes.
@Lyndon-Li Does moving this functionality to backup describe resolve your objections? @mrnold Does backup describe work for your use case as well as kubectl get du?
Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.
@sseago Let me conclude the info I think we need to have for the backup/restore performance observability:
elapsed time
kubectl get du
, we already have it, that is, the STARTED
columnvelero backup describe
because DU doesn't guarantee existing after backup completes and it is not so meaningful to check static info through kubectl get du
cached bytes
total bytes
and for incremental backup, it should normally be less than totalBytes
throughput
For all these info:
velero backup describe
. To make it, I think we can add the info to VolumeBackupInfo/VolumeRestoreInfo, so that we don't need to download every DU/DD during the describe.kubectl get du
. Because kubectl get du
is for realtime observability only, so it improves user experience since users see the info dynamically changed and earlier.Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.
We don't need to use startTimestamp
to calculate elapsedTime
. The STARTED
column is already elapsedTime
(before the backup/restore completes).
Also, even during current backup, elapsed time is not "[current time] - startTimeStamp" -- it's "if completed completionTimestamp - startTimestamp else current time - startTimestamp" -- volumes that finish earlier will otherwise appear to take much longer to complete than they actually do.
We don't need to use
startTimestamp
to calculateelapsedTime
. TheSTARTED
column is alreadyelapsedTime
(before the backup/restore completes).
It isn't though. If you have 2 DUs that run sequentially in a backup (on same node, with parallel execution disabled), and each takes 10 minutes, at minute 19, you'll see DU1 showing STARTED at "19m" and DU2 showing STARTED at "9m" -- but elapsed time should tell us how long the DU has been in progress until completion -- so those fields would need to show "10m" and "9m".
U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"
Yes, it is. but just keep my point in mind that kubectl get du
is only meaningful during that DU/DD running. Once it completes, we should check with velero backup describe
. If you agree, we will not need to add one more field in DUCR/DDCR.
U1 showing STARTED at "19m" and DU2 showing STARTED at "9m"
Yes, it is. but just keep my point in mind that
kubectl get du
is only meaningful during that DU/DD running. Once it completes, we should check withvelero backup describe
. If you agree, we will not need to add one more field in DUCR/DDCR.
OK, lets step back a bit. If I'm a user monitoring a running backup and my main concern is DU progress, I want to look in one place to see: 1) which DUs are currently in progress 2) for this backup, how long has each DU taken (either to completion or to now, still running), how much is done (for still running ones), and total bytes (and eventually some data indicating skipped/transferred for incremental volume backup)
Currently, the only place to see any of this is kubectl get du
. There we see in progress status, bytesDone, totalBytes, but we don't see the correct elapsed time for completed, only for in-progress.
One possibility is to enhance the du display columns. Of course, to get elapsed time would require a new field.
Another would be to enhance velero backup describe
-- perhaps a du summary which gives similar information to kubectl get du
, with proper calculated field for elapsed time. If done here, we wouldn't need the CRD change. But the important point is that this would need to work for in-progress backups as well as completed ones, since this is needed for real-time observability.
Either way, we need a real-time view (changes while backup and du progress) that shows completed, in-progress, and queued DUs for the current backup with elapsed time, bytes progress, etc. as appropriate.
@sseago After a further thinking, below may be better for elapsedTime:
elapsedTime
field into DUCR/DDCRSTARTED
column show the time since the DU/DD is AcceptedUse elapsedTime
over STARTED
because it is updated simultaneously with doneBytes
and cachedBytes
. Then when we calculate throughput
, the result is always correct; while STARTED
or currentTime
- startTimestamp
is updated by kubernetes, if we use it to calculate throughput
the number may be weirdly jumping if doneBytes
/ cachedBytes
is not updated timely.
Changing STARTED
because it is duplicated with elapsedTime
. But rather than simply removing it, we can change it to express other info --- the time to prepare the snasphot/restored volume.
Per discussion, let's keep it ice-box given we don't have an agreement on how to demonstrate the incremental backup. And we need a design.
I tried using cachedBytes, but did not have much success. It seems like kopia calculates cachedBytes up front, so using (doneBytes-cachedBytes)/elapsedTime for throughput shows a negative number for most of an incremental backup transfer.
I see there is also an uploadedBytes that is supposed to update when bytes are uploaded to storage, but in my tests I did not see it change away from 0. So I agree I don't really have a good base for a design yet.
Describe the problem/challenge you have We had users evaluating backup performance by watching the progress of DataUploads, basically like this:
$ oc get dataupload --all-namespaces -w
NAMESPACE NAME STATUS STARTED BYTES DONE TOTAL BYTES STORAGE LOCATION AGE NODE
openshift-adp test ts-velero-test-1 0s
openshift-adp test ts-velero-test-1 0s
openshift-adp test Accepted ts-velero-test-1 0s
openshift-adp test Prepared ts-velero-test-1 62s ip.lan
openshift-adp test InProgress 0s ts-velero-test-1 62s ip.lan
openshift-adp test InProgress 13s 555745280 ts-velero-test-1 75s ip.lan
openshift-adp test InProgress 23s 1073741824 ts-velero-test-1 85s ip.lan
openshift-adp test InProgress 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan
openshift-adp test Completed 24s 1073741824 1073741824 ts-velero-test-1 86s ip.lan
It would have been helpful to see a little more information about the transfer, to help identify where to allocate resources for their cluster.
Describe the solution you'd like I would like to add another field or two to this output so we can see elapsed time and transfer rates at a glance. This would help them estimate what backups can fit in their time window. It would also be good to see a count of the actual bytes moved over the network, taking kopia's incremental snapshots into account.
Anything else you would like to add: I have not yet researched what kind of progress reporting kopia provides, please chime in if this sounds infeasible.
Environment:
velero version
):kubectl version
):/etc/os-release
):Vote on this issue!