jborg / attic

Deduplicating backup program
Other
1.11k stars 104 forks source link

memory usage is too high #302

Open ThomasWaldmann opened 9 years ago

ThomasWaldmann commented 9 years ago

To accelerate operations, attic keeps some information in RAM:

In this section (and also the paragraph above it), there are some [not completely clear] numbers about memory usage: https://github.com/attic/merge/blob/merge/docs/internals.rst#indexes-memory-usage

So, if I understand correctly, this would be an estimate for the ram usage (for a local repo):

chunk_count ~= total_file_size / 65536
repo_index_usage = chunk_count * 40
chunks_cache_usage = chunk_count * 44
files_cache_usage = total_file_count * 240 + chunk_count * 80
mem_usage ~= repo_index_usage + chunks_cache_usage + files_cache_usage
                      = total_file_count * 240 + total_file_size / 400
All units are Bytes.
It is assuming every chunk is referenced exactly once and that typical chunk size is 64kiB.

E.g. backing up a total count of 1Mi files with a total size of 1TiB:

mem_usage = 1 * 2**20 * 240 + 1 * 2**40 / 400 = 2.8GiB

So, this will need 3GiB RAM just for attic. If you run attic on a NAS device (or other device with limited RAM), this might be already beyond the RAM you have available and will lead to paging (assuming you have enough swap space) and slowdown. If you don't have enough RAM+swap, attic will run into "malloc failed" or get killed by the OOM Killer.

For bigger servers, the problem will just appear a bit later:

anarcat commented 9 years ago

so could these caches be turned into fixed-size (say relative to available RAM for example) LRU caches? in other words, are they really caches (that we can discard) or indexes (that we can't discard)?

ThomasWaldmann commented 9 years ago

So, the question now is "what are the options to deal with bigger data amounts?".

Some ideas:

ThomasWaldmann commented 9 years ago

@anarcat they are caches in the sense that they cache information from the (possibly remote) repository. So you could kill them and they could be rebuilt from repo information (or from fs when creating the next archive).

LRU won't help as for the files every entry is accessed only once per "attic create". For the chunks, there are sometimes multiple accesses, but not in a way where LRU would help.

anarcat commented 9 years ago

ah right, so even if the caches would be reused, not much because it's only for "within a filesystem" deduplication...

okay, so another strategy, which you seem to already have a few ideas for.. i guess the next step is benchmarks, as there are fairly low hanging fruits there (chunk size, for one..)

level323 commented 9 years ago

My 2 cents is that the chunk size and whether or not the cache should be maintained in RAM will depend on the particular circumstances to which attic is being applied as there are many use cases, variables and trade-offs to consider.

Therefore, my present assessment is that it makes sense to:

  1. offer an option to specify chunk size at attic repo creation time, and
  2. gracefully and automatically fail-over to on-disk storage of the cache when a (preferably user-specifiable) RAM usage threshold is exceeded.

Regarding point 2, modern linux kernels support per-cgroup resource limiting. So one way to address seamless fallback from RAM to disk would be to put attic in a cgroup with whatever resource limits and swappiness suit their particular use case. However, this may be considered a bit of a hack and, of course, will not help Mac or Windows users.

mathbr commented 9 years ago

@ThomasWaldmann as requested on #300 here is a bit more data from my setup: my media weighs 2.8 TB and currently has 6109 files. Usually memory usage of Attic was ~11% but at the end it was mostly ~50%. Right before Attic died the usage went up to ~70%. Let me know if you need more details.

ThomasWaldmann commented 9 years ago

@mathbr ~70% of 8GiB is ~5.6GB. The formula computes 6.5 (5.3 if remote repo) GiB RAM usage for your backup data. As the formula does not consider all of attic's memory needs, just the repo index and files/chunks cache, it seems to fit. If you had some other stuff running besides attic and your swap space wasn't very large, that maybe was all the memory you had.

mathbr commented 9 years ago

Well there where indeed a few apps running in parallel, most of the memory being claimed by Chromium and Plex Media Server, everything else is rather lightweight (running Xfce as desktop).

My swap is at 2GB which is is not much but with 8GB I actually shouldn't need it at all. ;-)

ThomasWaldmann commented 9 years ago

about mmap: see https://github.com/jborg/attic/commit/2f72b9f96001310ca4a81f8336545f2a3dd1de04

mathbr commented 9 years ago

Has anyone tried again with that latest change yet? I'd like to know in advance how this fares before giving it another try. ;-) Just noticed that this change was from July 2014, nevermind.