Duplicacy backup slows down over time

tbain98 commented 7 years ago

When doing an initial backup on a 250GB repository, my upload speed starts high (~5.4 MB/s) but by halfway through it has dropped to less than half that (2.39 MB/s).

Backup programs typically experience slowdowns over time due to the increasing size of the data structure used to perform deduplication, but since Duplicacy leverages the remote filesystem for that function, I wouldn't have expected to see that type of slowdown here.

Configuration details:

Storage: Backblaze B2
Threads: 16
Cores: 4 (on an ARM64 ODroid C2)
CPU Usage: Originally ~120% of one core. At the halfway mark, ~90% of one core.

leftytennis commented 7 years ago

In my usage, I've noticed the same thing, but I suspect it's how duplicacy is reporting the speed, not the actual backup speed. For example, when it skips chunks that it doesn't have to upload, that artificially raises the speed reported. Another variable is how well data compresses.

I believe duplicacy is reporting total size completed, not actual compressed size. I don't know that for a fact as I haven't looked at the code here, but that's the impression I got.

gilbertchen commented 7 years ago

Right, the reported speed is based on the number of raw bytes before compression and deduplication, so the speed may vary a lot depending on the actual file contents.

tbain98 commented 7 years ago

Unfortunately, the theories you both put forth don't match the behavior I've observed.

I'm currently doing the initial backup for my repository, but I've cancelled and resumed it several times, primarily to tweak the number of threads but also once because my Internet provider had a service outage that killed the backup where it was.

Each time I restart, there's a relatively small number of chunks that have fluctuating speed, but I very quickly settle into a steady-state speed, and then I'm at that exact speed for chunk after chunk after chunk. That makes sense since the content in question is primarily non-duplicate digital photos and movies, which are never going to benefit from deduplication or compression, and which will always require approximately the same processing effort.

Instead of "jumpy" rates that vary from chunk to chunk, what I see is very consistent rates at every time I check on it, but what those rates are at hour 0 is much faster than what I see at hour 12, which in turn is faster than what I see at hour 24. Put another way, if you graph the speeds, you'd get a very smooth curve that trended downward over time, not a graph that's jagged but averages out the same over its lifetime. So although the explanation you both provided could very well be what some people experience, it isn't what's going on for me.

One thing I wasn't sure about was how Duplicacy reports chunks it's able to skip because they were completed in a previous incomplete backup that's being resumed. I've not seen pages upon pages of lines saying "chunk skipped," so I assume that Duplicacy just silently starts working from the point that the incomplete backup left off, without giving any indication of how much content it's skipping. Is that accurate for how the code actually works?

If so, I may submit a new issue requesting improved logging to indicate what's being done; if not, then it means that I'm re-uploading the entirety of my repository when few to no files have changed, which would imply that something's not working properly in the chunking process.

stevenhorner commented 7 years ago

@tbain98 I have had the same behaviour as you, starts fast and gradually slows at least this happened for the first few backups when I did the same as you described. The backups never completed (due to me cancelling), so guess it skipped lots at the start which gave a false fast speed, then when it actually started uploading it appears to gradually slow. Since the original backup completed I don't think I've seen this so much.

I never saw any option for logging, so I just wrote the output of Duplicacy to a log file:

duplicacy backup >> /var/log/duplicacy.log

Beware that file becomes huge if you have a lot of files. But it's useful to see what happened, grep for errors, skips or totals.

gilbertchen commented 7 years ago

There is a feature request to use a smaller window to calculate the upload speed: https://github.com/gilbertchen/duplicacy/issues/128.

Once that has been done we'll know for sure if there is a real issue here.

@tbain98 when a backup is aborted, the lists of files and chunks that have already been uploaded will be saved to a file .duplicacy/incomplete. On resume, the chunks contained in this file will be checked against the storage. All files whose chunks exist in the storage will be skipped. These skipped chunks are not counted in when calculating the upload speed.

tbain98 commented 7 years ago

I observed that behavior on my first backup to an empty B2 bucket. It was what prompted me to stop my backup in the first place so I could increase the number of upload threads, because performance had dropped to under half of what it started at after 8-12 hours (I don't remember exactly how long I let it run before I checked back in and saw that it was slow). And when I see it happen on incremental backups, I saw almost no skipped chunks and the upload rate computation had stabilized by the time I observed the initial speed.

I don't believe that this is in any way related to skipped chunks or resuming an incomplete backup.

mister2d commented 6 years ago

@tbain98 Old thread, but I'm wondering if you still receive the same symptoms. I too thought something was wrong after seeing a decrease in the rate as reported in the stats.

However, I pulled the SNMP metrics from my router and it indeed appears that the rate is consistent instead of dropping off like first thought.

Check out my screenshot from a 12 hour time period. https://image.ibb.co/kaK6Ec/12hr_duplicacy.png

Maybe this is an average calculated over time much like the average MPG value in my car. If I reset it at the start of a trip it remains in the high range. After about a week it settles back down to the mid teens in MPG (gas guzzler).

tbain98 commented 6 years ago

I only saw this behavior during my initial backup to B2; I've not noticed it on more recent incremental backups (though I've really not been looking for it so it's possible it's still there and I've just not noticed it).

I don't buy the "average over time" argument, though. Your car reverts to its mean, even if you have short measurement periods that differ from that mean. This wasn't a reversion to a mean (after which it remains static), it was a slow decline over a long period of time.

However, based on what your router showed, it's possible that this issue is simply a problem with the computation of the metrics themselves rather than a problem with the underlying functionality.

At some point, it's possible that I'll do a full upload to a new B2 bucket to test whether I can reproduce this behavior, but it's not something I've put the time into thus far.

gilbertchen / duplicacy

Duplicacy backup slows down over time #199