sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.9k stars 132 forks source link

Bounded resource usage (feature) #109

Closed vvs- closed 9 years ago

vvs- commented 9 years ago

Consider this use case. I'm really stuck here trying to deduplicate over 5 million files. The memory usage goes far beyond available RAM and using large swap won't help because processing takes eons. I think that using some database, like e.g. sqlite, to hold files metadata would be a lifesaver. This is similar to #25 but with different scope: to limit memory usage and make rmlint scalable on huge number of files.

sahib commented 9 years ago

Phew, that's a hard case. Actually we are one of the rather memory saving tools out there... Implementing a sqlite cache will probably never happen, but maybe there's another option.

First of all: You did not mention how you run rmlint. Some options like -D or -ppp or some outputs (try to disable all!) might be quite memory expensive. Secondly: Im not quite sure yet if this is just the metadata causing the footprint. Also databases do not guarantee a relief in this case...

How much RAM do you have installed and at what point (before traversal finishes/after/somewhen else) does it exceed the limit? A usual RmFile struct costs about 300 bytes, so with 5 million files this sums up to 1.4GB (+ some buffers). Which is large, but still somewhat reasonable in that case.

You're still using the develop version?

vvs- commented 9 years ago

I use paranoid file compare but this is the point. I don't completely trust hashes in such case, just as a candidate speed-up. And files are rather small, their vast number is the culprit. There are huge Subversion repositories there, I mean with full history, which make you appreciate git. I made backups at different times and then just copied them all over. Now there are many duplicates there so it would free much disk space. It would be a trivial SQL query to find all duplicate files in the database and wouldn't take so much memory. Then it would need to just compare files pair by pair which shouldn't take much memory either.

And yes, the footprint is 1.6GB of resident size and about 3GB of virtual size. Unfortunately, I have only 2GB of RAM and that feels very painful. Everything slows to a crawl and the system is almost unresponsive. I think it's a memory cache which was hurt most, so constant thrashing/swapping considerably slowed the comparison and file system ops. And you never know it beforehand, so it was an unpleasant surprise when OOM killed some system processes. Fortunately, it was not a server or it'd be another story. The memory footprint was big from the start, but it stuck on matching stage where it found just about 100000 duplicates after eleven hours. I killed it after that, as you could guess it won't finish even tomorrow.

Yes, that's on develop branch.

sahib commented 9 years ago

By the way rmlint works, it needs to know most of the metadata at the same time anyways - or at least it cannot predict when it does need that. So even with a cache the same situation might happen. Therefore implementing a SQLite cache is neither trivial nor would I expect noticeable improvements from that.

The memory footprint is usually the highest short after traversal and gets less after that since we get rid of unused structs once found (except for -D) - so it might have actually been getting better after a while. You could even try to change the I/O scheduler or drop the caches from time to time.

Regarding paranoid: The amount of memory used by the paranoid mode is managed and by default cut by around 256MB (see --max-paranoid-size). So paranoia might be even a good idea in your case. You could try a normal hash function though, just for the sake of comparing the memory footprint.

From my view, this is a general problem in the sense of "large data sets yield large resource usage". There is surely some potential to reduce overall memory usage by compressing the paths used by all the RmFile structs (since they use common prefixes) though. But at the end of the day the amount of memory used by rmlint will increase with the number of files.

vvs- commented 9 years ago

You made a point here. The paths are really long for many files. May be compressing them would help. Also, rmlint is not a blackbox - it's a tool with many options. I would consider using it for building candidate list and processing it with another options if it would save me some memory. And I think that running an SQL query every time to find next candidates would do just that. After all they do a good job at optimizing it.

sahib commented 9 years ago

Maybe I take a try at that once I have the time. In the meantime some statistical questions:

  1. How many paths are in your dataset? (you mentioned, 5 million, how exact is this?)

    $ find /absolute/path -type f | wc -l

  2. How long is one path in average?

    You can get the sum via:

    $ find /absolute/path -type f | python -c 'import sys; print(sum(len(l) for l in sys.stdin))'

    Needs to be divided through the total number.

  3. How many of them have unique sizes?

    This should be printed along rmlint's -vvv output (List build finished...) or can be checked with some bash oneliners.

vvs- commented 9 years ago

Total files: 5389618 Total directory nodes: 79144 Directory leaves: 52261 Average path length: 93.633236 Files with unique sizes: 5111410 (?) that's the number at the end of "List build finished" or did you mean "Discarding unique sizes"? Then it's "removed 36683 of 2635993", whatever that means.

BTW, I found that there are drastic differences in processing time on hot and cold memory caches. It took 37 minutes for rmlint to build the list on the hot cache and 55 minutes when the cache was cold. So, memory footprint makes a very big difference.

sahib commented 9 years ago

Thanks for the numbers. The last number was just to verify how many paths land in the actual duplicate-finding-phase.

Well, it's no news that the build goes faster on consecutive runs. On btrfs this can be quite extreme, I saw a factor of around 20x between first and second run. This has (almost) nothing to do with memory footprint though. find(1) for instance has the exact same behavior.

vvs- commented 9 years ago

Actually it's the second run which was slower. The first one was after find and used directory tree cached in memory. The huge footprint makes memory cache always cold after certain point.

sahib commented 9 years ago

I added two new options:

Here's a short test with 1 million files:

http://i.imgur.com/Ck9FrFG.png

Caveats: It does not work for paths piped into rmlint via find yet. Also use with care in general.

vvs- commented 9 years ago

It looks promising!

I found one problem. Try this: mkdir 1; cd 1; truncate -s 6 0000; seq 1 8191 | xargs printf '%04X\n' | xargs -n 1 ln 0000. It will slow rmlint 1 --with-metadata-cache significantly. Without cache it works fast enough.

Also, I already noticed that it doesn't work with find.

I have a question though. When used with empty directory option it's reading more than 700 MB per minute from disk, while find read only 520 MB in total. Is that normal? I wouldn't expect it to read more than find to look for empty directories.

sahib commented 9 years ago

I use SQLite's rowid, which should be always indexed, even without explicitly defining one:

https://www.sqlite.org/queryplanner.html

The slowness is due to the massive amount of selects the current pre-processing emits. When the path was still in memory this was no problem, but now every path is looked up several times for every hardlink it has (in your testcase a lot).

Support for find should be in now too...

To your question: rmlint is not find. Why should the numbers be comparable?

vvs- commented 9 years ago

That's unfortunate because I have similar directories among the data. I expected rmlint not to look for hardlinks if it was so told. In case of looking for empty directories I supposed that it is going to traverse file names only without additional processing. It is much faster to link them again. How can I turn off hardlink processing in that case? The rmlint -L doesn't help either.

Another possibility would be storing and querying by inode number (indexed).

find doesn't work with progressbar for some reason.

BTW, why not make it another formatter for consistency and functionality?

sahib commented 9 years ago

rmlint has first to know if it's a hardlink before it can ignore it. -L won't do anything here of course. The solution is rather to fix the relevant code, not to work around it.

@BTW: What is it? The progressbar? It is already a formatter. Please be a bit more specific when writing answers, especially including the command you run. It takes me sometimes way too much time to figure out what you mean sometimes.

Edit: Problem with find and progressbar confirmed.

vvs- commented 9 years ago

Sorry, I didn't recognize that it's ambiguous. English is not my native language. I meant that making the sqlite cache a formatter instead of a special option would be more consistent with other functionality and would allow to specify some parameters, e. g. a file name. It would also allow to make that cache persistent as it's with JSON.

It really makes no sense to ignore hardlinks if it takes so much time to find them. I'd rather process them as ordinary files instead. The time it takes to compare them would be much less in this case. If I understand you correctly there is no option for that. And why it doesn't compare just inode numbers which are already in memory without querying the database? Also, it seems to me that storing inode numbers in database with paths will help to optimize that case. It would be possible to query just hardlinked files instead of just all paths. And if it was a formatter it would be the similar logic as it's now with JSON.

sahib commented 9 years ago

No problem, just a (more or less) friendly reminder.

Making the cache a formatter is a valid discussion point, but I chose against it since I don't want people to rely on the cache internas. It's simply meant as an memory optimization that should be bound to one process only. It's true that for some people a separate sqlite formatter with more information in it might be useful, but I don't intend to write one due to my limited time (patches welcome though).

Regarding hardlinks: Finding out files with double inodes and device id is not hard indeed. What's slow in the current code is finding path doubles, i.e. files that are exactly the same (same inode, same device id, same parent inode and parent device id) since they were given twice. Example: rmlint /tmp /tmp would need to take only one copy of each file. I already rewrote that part, but it's rather tricky logic and needs a bit of local testing since it's easy to screw up heavily. Previously path doubles were recognized additionally by the same basename, which actually caused the many database queries. But the code was slightly weird in general, thus the rewrite.

sahib commented 9 years ago

The progressbar bug should be fixed by the way.

vvs- commented 9 years ago

Thanks!

sahib commented 9 years ago

The part I mentioned is rewritten now. I updated the sqlite-hash-table branch - it surely needs some testing before it goes into develop.

vvs- commented 9 years ago

Thank you!

But something is not working right, I'm afraid. I did the same test mkdir 1; cd 1; truncate -s 6 0000; seq 1 8191 | xargs printf '%04X\n' | xargs -n 1 ln 0000 and added another file and its' copy. The command rmlint 1 -T df -pp --without-fiemap --with-metadata-cache -o progressbar produced very weird results. I've got a Hardlink file size changed during traversal warning for every non-hardlink there and there are no duplicates found. Also, it claims that there are 0 usable files in traversing stage and 8194 after preprocessing. Matching sees 1 file of 16384 GB. There might be another 32-bit quirk or at least it looks similar.

sahib commented 9 years ago

D'oh - Sorry. Technically, it's the same issue as the one you reported. I just tried to delete some duplicate code lines and forgot to test them. Should be fixed..

vvs- commented 9 years ago

Excellent news!

I think that the code is perfect for my purposes. I'll try to deduplicate my data with it. This will take a while. I'll report any bugs that I may find.

But for now I'm opening another bug report about the progress bar.

vvs- commented 9 years ago

While rmlint is still matching my data I noticed that the last line of progress bar is cut. I think that the word files in Matching files could be omitted to save space. Also, there is an excess space after scan in there.

SeeSpotRun commented 9 years ago

I'd be curious to see how https://github.com/SeeSpotRun/rmlint/tree/dir_tree compares for your use case too. It's a quick and dirty implementation of a different way of storing the file paths using a n-ary tree. This effectively de-duplicates common elements of path strings so if we already have /very/very/long/dir/path/file1 then adding /very/very/long/dir/path/file2 only requires 5 9 additional bytes (Edit: to store the path, that is... there is still around 160 bytes or so for the rest of the RmFile structure, plus the Fiemap structure for each file which I guess is around 40 bytes per file fragment)

vvs- commented 9 years ago

After running for 28 hours it crashed with gmem.c:103: failed to allocate 4194304 bytes. It had less than 4000 files remaining to scan and I was expecting it to finish at any minute. The memory use was modest at all times, no more than 1.12 GB but slightly increasing to the end. No swapping happened and the system was available at all times. I noticed that it was mostly CPU bound from the start but at the end became mostly I/O bound with rates rising from 2-3 MB/s to 50-60 MB/s and processing slowing to a few files per minute. The last progress report shows 2831619 dupes out of 1089222 originals. I'm investigating possible cause.

@SeeSpotRun Thanks. That's certainly interesting and I'll experiment with it a bit later after I find possible workaround for the crash.

sahib commented 9 years ago

It's hard to guess what caused this, since it's an problem in an external library. We do not check the return value of malloc, since we can't do much if it fails except reporting that it failed. My best guess it is a 32 bit integer problem in some way again.

One possible idea: Can you try to run it on a few files larger than 4GB? Or even much larger than that?

vvs- commented 9 years ago

In this case I'll recompile it with debug information and allow it to produce core dump. Then it should be possible to analyze it in gdb to determine the exact place and conditions. But there is another problem which actually prevents me from doing that. I tried to limit the number of possible candidates by -b parameter and it caused the same high CPU utilization as it was with hardlinks. I guess you disposed of the hash table at the end of traversing? Could you change it so the matching stage would benefit from it as well?

Yet another idea for optimization: the traversing stage took 51 minutes, but the find can traverse all files by reading the number of their links and sizes in only 20 minutes. It seems to me that by using only that same information it's possible to find all hardlinks if every file can only be accessed by a single path, i.e. by not following symlinks. Would that reduce the time for traversing? What do you think?

sahib commented 9 years ago

A core file might help, but no guarantees. What was the exact command you used to run? It is especially interesting if you used the paranoia-mode.

--basename has to read the basename for each file. It might be possible to optimize that by caching the read basename, but that will be done after merging with @SeeSpotRun's version. Afterwards the default rmlint will use less memory too, but still has the option to cut the usage more violently. After that's done we can improve the basename problem. The problem is completely separate from the hash table you mentioned by the way - that one was for identifying path doubles and hardlinks.

Regarding find: Again, rmlint is not find. We do quite a bit more in our traversal stage. There are more things to watch for than symlinks and hardlinks like bindmounts and non-regular files. Also we need to check for other lint if you didn't disable that. Instead of just printing the information, we also need to built data structures for later - all in all a lot more information that's needed. Furthermore: Finding hardlinks is not the problem that eats resources.

Btw.: Did you fs drop caches between obtaining those numbers?

vvs- commented 9 years ago

The command was rmlint dir -T df -b -pp --without-fiemap --with-metadata-cache -o progressbar -o summary -o sh:rmlint.sh -c sh:hardlink -VVV

The version by @SeeSpotRun has the CPU utilization problem too but in a different context.

Well, I understand that rmlint is not find. But still it's useful to compare some timings. I find it a bit frustrating when it took 2.5 times more time than it takes to just read the relevant information from disk. When one just need to find duplicates there is a possibility that there are some processing that could be omitted to save time.

I didn't drop caches because I wanted to see if running rmlint after find would improve its' timing as well. It didn't help though.

vvs- commented 9 years ago

Here's the cause. Run mkdir 1; cd 1; seq 347 | xargs -n1 truncate -s 100m; seq 347 | while read n; do echo -n $n | dd conv=notrunc of=$n 2>/dev/null; done; cd ..; cp -a --sparse=always 1 2. Now run rmlint 1 2 -T df -pp --without-fiemap --with-metadata-cache -o progressbar -VVV. It's just a matter of time when the virtual memory get exhausted.

SeeSpotRun commented 9 years ago

That's a bit of a corner case; 700 large files all of the same size. The memory manager for paranoid hashing doesn't currently contemplate that one. A quick fix at https://github.com/SeeSpotRun/rmlint/commit/57779832db9171c7ff17792a406a41542e1d348a should need about 1.4G to handle the 700 files.

vvs- commented 9 years ago

Not at all. There are just (not so) big multivolume archives with the same volume sizes and several copies. That's why rmlint is needed isn't it? The actual volume size is half of that but I can't make a test case which would account for other working memory, so I tweaked it slightly to clearly demonstrate the issue on a smaller subset.

sahib commented 9 years ago

Well, I understand that rmlint is not find. But still it's useful to compare some timings. I find it a bit frustrating when it took 2.5 times more time than it takes to just read the relevant information from disk. When one just need to find duplicates there is a possibility that there are some processing that could be omitted to save time.

During traversing a lot of data structures are built too, which goes into that time needed for "traversing". Comparing those numbers simply makes no sense. It's true that optimizations are always possible but they always come with the price of more code (and thus more points of failure) and more work for us.

Not at all. There are just (not so) big multivolume archives with the same volume sizes and several copies. That's why rmlint is needed isn't it?

Yes. It's true that this configuration might appear in the wild.

But your setup is definitely a corner case since you are on highly limited RAM. Especially when using it to run on millions of files with paranoid mode. You cannot kill every problem with software alone.

We appreciate your constructive criticsim and testcases, but we're no request show.

vvs- commented 9 years ago

Sorry, I'm not requesting anything. I'm just pointing to real issues, IMHO. You should decide what to do in each case. I can workaround every single case by just excluding it from deduplication.

But your setup is definitely a corner case since you are on highly limited RAM. Especially when using it to run on millions of files with paranoid mode. You cannot kill every problem with software alone.

But the average user has no clue that such data set is very demanding on resources. Actually, it's the process of deduplication and bug finding when it became obvious. So, you should expect that such setup is the norm for people with very old backups.

Allow me to elaborate on this a little. Memory is expensive. My motherboard supports only up to 4 GB. So, upgrading would require to replace motherboard, CPU, memory and may be even some peripheral devices. And even if I do, the old system is likely to become some NAS box for my backups. But disks are cheap, so I'd prefer to buy an additional disk and copy data to it. Servers are likely be attached to SAN with built-in online deduplication. That's why I believe that offline deduplication is for a low memory setups and it won't be in a high demand on a premium systems.

BTW, rmlint -b -pp use tiny fraction of memory to process this testcase but it doesn't cope well with big number of files presently.

SeeSpotRun commented 9 years ago

All good points. It's certainly not such a corner case as I thought - I hadn't considered multi-volume archives always being split into same-sized chunks. I'm re-thinking / re-working the memory manager for paranoid hashing in https://github.com/SeeSpotRun/rmlint/tree/mem_miser. I'm not sure if the current version there covers all cases but it's getting closer.

vvs- commented 9 years ago

Thank you for all your hard work!

Your work will certainly benefit the product and all those who want to use it.

SeeSpotRun commented 9 years ago

No worries; the improved mem manager is much better than the old. Unfortunately we lost a fair bit of speed (not from the mem manager but from the n-ary tree used to reduce the RAM required to hold all the file paths). I still have a couple of things to try to hopefully speed that up.

sahib commented 9 years ago

@vvs: Nice to hear it ran through.

The --with-metadata-cache should be working again. The plus of saved memory will not be as much as before, since the default implementation will uses much less memory now thanks to @SeeSpotRun. Out of curiosity (and if you did not delete all that data yet), how is the timing/memory footprint of --with-metadata-cache to the normal run you posted?

vvs- commented 9 years ago

The plus of saved memory will not be as much as before

It doesn't matter as long as it didn't crash or caused other nasty effects when the memory was low.

Out of curiosity (and if you did not delete all that data yet), how is the timing/memory footprint of --with-metadata-cache to the normal run you posted?

I plan to run it again with other optimization options. I'll post the results here.

Thanks again for all work on this!

vvs- commented 9 years ago

Using with-metadata-cache rmlint did the same run for 11h 28m, i.e. about 17% slower. It used about 17% less memory (a coincidence?). But this comparison is not entirely correct because there were some additional fixes made to rmlint between the runs.

sahib commented 9 years ago

Heh , no coincidence I believe. Numbers should be correct enough for a rule of thumb. Okay, that's enough to justify the continued existence of --with-metadata-cache I guess.

For future reference, part of the discussion here also happened in https://github.com/SeeSpotRun/rmlint/issues/1. Is there anything open still or can I close this issue?

vvs- commented 9 years ago

I believe that the original issue was fully addressed.