markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
816 stars 81 forks source link

hashfile silently not updated, new files not scanned if a file in hashfile has shrunk below blocksize #139

Closed olifre closed 8 years ago

olifre commented 8 years ago

If a file which is in the hashfile has shrunk below the blocksize since the last scan, any database update (and also deduplication) will silently be skipped.

Here's what's going on:

  1. add_file_db is the function which does all magic, calling __add_file for each file.
  2. __add_file has the following part: https://github.com/markfasheh/duperemove/blob/master/file_scan.c#L208
  3. If we go there, __add_file will return 0, but leave the '**ret_file' parameter untouched (as expected).
  4. Now, back in add_file_db, we have: https://github.com/markfasheh/duperemove/blob/master/file_scan.c#L431
  5. So in case we have a "Skipping small file" for any file from DB, 'add_file_db' will falsely interpret the untouched file parameter as "ENOMEM".
  6. add_file_db will thus return "ENOMEM", producing no error message(!), and silently the DB-update and deduping will be skipped.
markfasheh commented 8 years ago

Bingo, thanks for the excellent analysis. I'll work on a fix.

markfasheh commented 8 years ago

Alright, my testing shows that commit cc72c3f should fix this. Would you mind giving latest master branch a test please?

olifre commented 8 years ago

Yep, that fixes it! However, I am getting a core dump right now a bit later (not sure if this problem is related...)

Total files:  21
Total hashes: 8420
Loading only duplicated hashes from hashfile.

On very first glance, this seems to be originating from the "hashes" table containing a hash for an ino / subvol pair which is not inside the "files" table (anymore)... Which then leads to dbfile_load_one_filerec calling filerec_new with the argument filename = NULL. I do not see the code path leading to this DB, so I am unsure whether the issue is really related.

olifre commented 8 years ago

I sadly did not manage to produce a simple reproducer, but I have some more info about this crash using --debug with a debug build:

Lookup/stat file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak" from hashdb
Lookup/stat file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js" from hashdb
Lookup/stat file "/home/olifre/.mozilla/firefox/abyu7248.default/cookies.sqlite-wal" from hashdb
Lookup/stat file "/home/olifre/.mozilla/firefox/abyu7248.default/webappsstore.sqlite-wal" from hashdb
Lookup/stat file "/home/olifre/.xsession-errors" from hashdb
Remove file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak" from the db
Using 8 threads for file hashing phase
csum: /home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js   [1/5] (20.00%)
csum: /home/olifre/.mozilla/firefox/abyu7248.default/places.sqlite-wal  [2/5] (40.00%)
csum: /home/olifre/.mozilla/firefox/abyu7248.default/cookies.sqlite-wal         [3/5] (60.00%)
csum: /home/olifre/.cache/plasma_theme_breeze-dark_v5.19.kcache         [4/5] (80.00%)
csum: /home/olifre/.xsession-errors     [5/5] (100.00%)
Total files:  5
Total hashes: 159
File "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak" still needs update in db
Loading only duplicated hashes from hashfile.
ASAN:SIGSEGV
=================================================================
==27311==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f5129b815da sp 0x7ffd81a5bf48 bp 0x000000000000 T0)
    #0 0x7f5129b815d9 in strlen (/lib64/libc.so.6+0x815d9)
    #1 0x7f5129b812fd in __strdup (/lib64/libc.so.6+0x812fd)
    #2 0x410f48 in filerec_alloc_insert /home/olifre/gits/duperemove/filerec.c:262
    #3 0x410f48 in filerec_new /home/olifre/gits/duperemove/filerec.c:290
    #4 0x42016e in dbfile_load_one_filerec /home/olifre/gits/duperemove/dbfile.c:1115
    #5 0x42016e in dbfile_load_hashes /home/olifre/gits/duperemove/dbfile.c:1164
    #6 0x405f3f in main /home/olifre/gits/duperemove/duperemove.c:729
    #7 0x7f5129b2062f in __libc_start_main (/lib64/libc.so.6+0x2062f)
    #8 0x406e68 in _start (/home/olifre/gits/duperemove/duperemove+0x406e68)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 strlen
==27311==ABORTING
markfasheh commented 8 years ago

Awesome, thanks. So it's clear the bug is because we're trying to create a file which does not exist in the db. We could avoid this pretty easily by just skipping those hashes that do not have a corresponding file record.

The question in my mind is how we got into the situation where there's hashes in the db but no file record for those hashes. IMHO the real fix is to not get into that situation (or if it's unavoidable then we can document it and clean up the DB).

olifre commented 8 years ago

The question in my mind is how we got into the situation where there's hashes in the db but no file record for those hashes. IMHO the real fix is to not get into that situation (or if it's unavoidable then we can document it and clean up the DB).

Yes - my problem is that I do not see how the code could ever end up in this situation... I was up to now only able to produce this with my root-volume which has quite a bunch of files on it, so I presume it's related to some kind of ugly race condition. Checking the DB, I get:

sqlite> select ino,subvol,loff,flags, (select count() from files where files.ino=hashes.ino and files.subvol=hashes.subvol) as filenum from  hashes where filenum=0;                                                                         
2380655|258|0|0|0
2380655|258|131072|0|0
2380655|258|262144|0|0
2380655|258|393216|0|0
2380655|258|524288|0|0
2380655|258|655360|0|0
2380655|258|786432|8|0
2380649|258|0|0|0
2380649|258|131072|0|0
2380649|258|262144|0|0
2380649|258|393216|0|0
2380649|258|524288|0|0
2380649|258|655360|0|0
2380649|258|786432|8|0
2380904|258|0|0|0
2380904|258|131072|0|0
2380904|258|262144|0|0
2380904|258|393216|0|0
2380904|258|524288|0|0
2380904|258|655360|0|0
2380904|258|786432|8|0

These three inode / subvol pairs do not show up in an older version of my dedupe-db, sadly. Checking them on my system, I get:

# btrfs inspect-internal inode-resolve 2380655 /
//var/lib/layman/mv/.git/objects/23/5de0492a59fed3a73359fa35409fce95edf1e0

The other two inodes are also blobs from that git repository. Stating these blobs, they are also not really new (from April this year, the DB from duperemove is from yesterday).

Checking the --debug output, I find the line:

Skipping small file /var/lib/layman/mv/.git/objects/23/5de0492a59fed3a73359fa35409fce95edf1e0

since the size of this blob has surely not changed, I have no idea why it should ever have entered the DB, so I am at a loss here :(.

markfasheh commented 8 years ago

Yeah when we write the hashes, the filerec goes first, and when we remove from the db we do it in the opposite direction.

That the file is too small for hashing but you have 7 hashes for it is also weird. On disk it's a small file but according to the db we've scanned about 900K. I almost want to say it's some bug with figuring out renames but I doubt very much that file was renamed and truncated recently.

markfasheh commented 8 years ago

Another thought - why is sqlite giving us a NULL pointer when filename is a NOT NULL field. Does this all reproduce if you delete the hashfile and start with a fresh one?

markfasheh commented 8 years ago

Sorry for the rapid fire updates, here's one more thing to try:

echo "pragma integrity_check;" | sqlite3 path-to-your-hashfile

should invoke an sqlite3 integrity checker. The documentation says it will print errors to stdout so you could paste any of those here.

olifre commented 8 years ago

Sorry for the rapid fire updates, here's one more thing to try:

No problem! The integrity checker just returned 'ok', nothing else...

Another thought - why is sqlite giving us a NULL pointer when filename is a NOT NULL field.

It's not, it's not giving back a row ;). Patching https://github.com/markfasheh/duperemove/blob/f533555172a6756c7f1850b1f17e6872af659287/dbfile.c#L1100 to read:

        if (ret == SQLITE_ROW) {
                /* print a diagnostic message here? */
                ret = SQLITE_DONE;
        } else {
                fprintf(stderr, "GOT NO ROW\n");
        }

I get

Loading only duplicated hashes from hashfile.
GOT NO ROW
ASAN:SIGSEGV
=================================================================
==20903==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f39a01a05da sp 0x7ffe9ecd4ed8 bp 0x000000000000 T0)

with the same backtrace, since the API returned "SQLITE_DONE" which is also accepted here... So the "NULL" is just undefined (or maybe rather, friendly / optimized) behaviour of the sqlite3 API if there is no return row but one still tries to access values of the "return row" which does not exist.

Does this all reproduce if you delete the hashfile and start with a fresh one?

I'll try that tomorrow and report back then (the scanning takes quite some time for the full volume) but I experienced a segfault after "Loading only duplicated hashes from hashfile." already once, was unsure whether this was since the DB was old (before many of the recent changes), deleted the DB, and only then after seeing it again started to look deeper at this issue. I'll start over with a fresh DB again tomorrow and check the details when the issue comes back (if suddenly different ino's are affected e.g....). I guess, just to make sure, I'll also scrub the volume.

markfasheh commented 8 years ago

I fixed up the error handling in dbfile_load_one_filerec(), it'll do essentially what you're doing now anyway (which is catch when we don't get SQLITE_ROW and print an error message). Thanks for everything you've done so far.

olifre commented 8 years ago

I just had another idea before falling asleep and finally calling it a day:

sqlite> select hex(digest),ino,subvol,loff,flags, (select count() from files where files.ino=hashes.ino and files.subvol=hashes.subvol) as filenum from  hashes where filenum=0;                                                             
7D299AD79FDC2D95149D38D1F9F08CE5|2380655|258|0|0|0
1C77CB2FC7AB96F3B34D9BC783338051|2380655|258|131072|0|0
B57333D24F18BC9605D3463FCB639A7A|2380655|258|262144|0|0
D0CBE7EC5E68834ED8999C05668CC0F7|2380655|258|393216|0|0
3BF039FED55A262BBC6889BAF092CA64|2380655|258|524288|0|0
862DA306869BDC567614E98D0F09BFE0|2380655|258|655360|0|0
8C647539B16E97E6314FBAFA028D971C|2380655|258|786432|8|0
7D299AD79FDC2D95149D38D1F9F08CE5|2380649|258|0|0|0
1C77CB2FC7AB96F3B34D9BC783338051|2380649|258|131072|0|0
B57333D24F18BC9605D3463FCB639A7A|2380649|258|262144|0|0
D0CBE7EC5E68834ED8999C05668CC0F7|2380649|258|393216|0|0
3BF039FED55A262BBC6889BAF092CA64|2380649|258|524288|0|0
A0822A257FD2DEE41706B396CDD772FA|2380649|258|655360|0|0
6A9E1D7D966CF3099E826DA9BEE1F6B2|2380649|258|786432|8|0
7D299AD79FDC2D95149D38D1F9F08CE5|2380904|258|0|0|0
1C77CB2FC7AB96F3B34D9BC783338051|2380904|258|131072|0|0
B57333D24F18BC9605D3463FCB639A7A|2380904|258|262144|0|0
D0CBE7EC5E68834ED8999C05668CC0F7|2380904|258|393216|0|0
3BF039FED55A262BBC6889BAF092CA64|2380904|258|524288|0|0
432F27F886147945948E790130DEDA34|2380904|258|655360|0|0
DF8A9AE1255AFEA099C030B23FA2F9F3|2380904|258|786432|8|0

These are the orphaned hashes... For this DB, I did a backup of the DB from the initial duperemove run before doing all later ones. The initial DB does not (yet) show any of these orphaned hashes. Checking with that one, I find:

sqlite> select hex(digest),ino,subvol,loff,flags from hashes where hex(digest)="7D299AD79FDC2D95149D38D1F9F08CE5";
7D299AD79FDC2D95149D38D1F9F08CE5|2379927|258|0|0
7D299AD79FDC2D95149D38D1F9F08CE5|2379933|258|0|0
sqlite> select * from files where ino=2379927;
/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak|2379927|258|797861|7|1471592612539708817|8
sqlite> select * from files where ino=2379933;
/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js|2379933|258|797861|7|1471592627624661288|8

That's much more realistic, these firefox profile files are always a moving target, likely to trigger some kind of race condition - and they are also more likely to have reached >700 kB...

I'll try tomorrow with a completely DB, and keep firefox running while doing "incremental" duperemove runs. Doing a DB-backup after each duperemove run until the issue is triggered may shed more light on what's going on.

Thanks for everything you've done so far.

No problem, thanks to you for this cool tool and all your work on it!

olifre commented 8 years ago

Reproduced starting from a fresh DB. I did the following:

duperemove --hashfile=dedupe.sqlite -dr / --debug > iter0.log 2> iter0.err
cp dedupe.sqlite dedupe_iter0.sqlite
duperemove --hashfile=dedupe.sqlite -dr / --debug > iter1.log 2> iter1.err
cp dedupe.sqlite dedupe_iter1.sqlite
duperemove --hashfile=dedupe.sqlite -dr / --debug > iter2.log 2> iter2.err
cp dedupe.sqlite dedupe_iter2.sqlite
duperemove --hashfile=dedupe.sqlite -dr / --debug > iter3.log 2> iter3.err
cp dedupe.sqlite dedupe_iter3.sqlite

This last run, I got:

$ cat iter3.err 
dbfile_load_one_filerec(): Database error 101 while executing statement: unknown error
Error loading filerec (2435954,258) from db

And indeed I find lost hashes which can still be correlated to recovery.bak / recovery.js of my firefox profile in the older database copies.

So I presume something around this happens:

I also retried a second time today with a fresh DB. This time, already the second run of duperemove produced a DB with dangling hashes, again, the famous recovery.js and recovery.bak of firefox (where it remembers the session / open tabs to restore in case of crash, so it's rewritten often).

So it's clearly reproducible with a very frequently rewritten pair of partially equal files, but I am still unable to make out where this would happen in the code :(.

Maybe you can make more of the logs... I stored the output of:

grep -H sessionstore iter*.log | grep recovery

which are the STDOUT logfiles from my second test here: http://pastebin.com/3VtK61GD . iter0 is from the very initial duperemove run without DB. iter1 is from an update-run with DB. This updated DB after duperemove has finished already contains the dangling hashes. iter2 is just another run which then triggers the new error-message behaviour.

I'll keep the full logs and DB on my SSD for a bit in case you have a better idea on what to look for...

olifre commented 8 years ago

Looking at the pastebinned logs again, I am a bit curious about this:

iter1.log:(dedupe) info[0]: name: "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js", fd: 158, logical_offset: 786432, bytes_deduped: 1835008, status: 0
iter1.log:(dedupe) info[1]: name: "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js", fd: 161, logical_offset: 786432, bytes_deduped: 1835008, status: 0
iter1.log:(dedupe) info[2]: name: "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak", fd: 162, logical_offset: 786432, bytes_deduped: 0, status: -22

Why is recovery.js here twice? Could something ugly happen like, the filerec was found once from DB, and once (with different inode) found in the regular scan?

Good luck with the logs!

markfasheh commented 8 years ago

Btw, I agree that this might be a problem with resolving the DB vs what was found in the regular scan. And unfortunately I don't really see where this could be happening in the code. The function to look at IMHO is add_file_db(). By the time that runs, we've already done the usual file scan so we're taking the filerec which we know to be up to date and comparing it against what's stored in the db.

I haven't been able to reproduce on my end yet, I'm even running firefox and having duperemove scan that directory in a loop.

olifre commented 8 years ago

I haven't been able to reproduce on my end yet, I'm even running firefox and having duperemove scan that directory in a loop.

I also tried that once without success... I guess it needs the additional delay of scanning a large directory / file tree (I could only reliably reproduce with my full root volume). Also, is your sessionstore (the recovery.js) large enough to warrant deduplication? I have >100 tabs open here (no time to clean up) which probably helps a lot in this regard...

Looking again at the strange output from my last post which lists recovery.js twice, with two different FDs... Could it have happened that an FD 158 was opened for recovery.js, and while that was held open, firefox did the rotation, renamed recovery.js to recovery.bak, placed a new recovery.js in the directory and thats the other FD 161 ? This would mean recovery.js in the "renamed version" and the new version are open in parallel, and in our filerec they have the same name, and we missed the rename, since it was done after the fopen(). I still don't see where exactly that would cause a mess in up the code, though...

markfasheh commented 8 years ago

I found at least one bug with renamed files which may explain your duplicated file names - the fix is in git now (and I have a test for it in duperemove-tests):

https://github.com/markfasheh/duperemove/commit/8dc8e47be1dba9a2b52ac6325ae603089a732ea7

Feel free to test that, or wait until I fix he other issue I think I found. Basically,

I'm not sure if any of these exactly fix your problem but they're both in the area where we're looking so hopefully one or both of them will do the trick.

markfasheh commented 8 years ago

Btw, yeah recovery.js is big enough to be deduped.

So, introducing a sleep() into populate_tree() gives me the opportunity to play with the files in between stat() and the actual scans so hopefully I can figure out something via that method. The only problem is that it won't be reproducible (which is nice for testing).

olifre commented 8 years ago

I'm not sure if any of these exactly fix your problem but they're both in the area where we're looking so hopefully one or both of them will do the trick.

I can sadly still reproduce, already on the second run (i.e. the first run of duperemove with a hasdb) the dangling hashes show up in the DB.

Here's the output of a run with debug and verbose, grepping only for the lines like:

grep -R recovery *.log | grep sessionst

http://pastebin.com/jFNKYnJe Note that I now get (as expected):

iter1.log:Lookup/stat file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js" from hashdb
iter1.log:File "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js", ino/subvol was 3065886.258 is now 3078868.258
iter1.log:Lookup/stat file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak" from hashdb
iter1.log:File "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak", ino/subvol was 3065787.258 is now 3078736.258
iter1.log:Remove file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js" from the db
iter1.log:Remove file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak" from the db

The dangling hashes in the DB produced from this "iteration 1" are for ino's 3078088 and 3078190, though. These ino's are not yet present in the DB before this run. They don't show up in the debug output. So the dangling hashes are added during the iteration 1 - it's not that they are present in the DB before and not removed.

Also, I still get:

iter1.log:(dedupe) info[0]: name: "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js", fd: 62, logical_offset: 786432, bytes_deduped: 2883584, status: 0
iter1.log:(dedupe) info[1]: name: "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js", fd: 63, logical_offset: 786432, bytes_deduped: 2883584, status: 0
iter1.log:(dedupe) info[2]: name: "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak", fd: 64, logical_offset: 786432, bytes_deduped: 0, status: -22

And before that even:

iter1.log:[0x61d00001c770] Add extent for file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak" at offset 786432 (61)
iter1.log:[0x61d00001c770] Add extent for file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js" at offset 786432 (62)
iter1.log:[0x61d00001c770] Add extent for file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js" at offset 786432 (63)
iter1.log:[0x61d00001c770] Add extent for file "/home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak" at offset 786432 (64)

This somehow looks like these sessionstore files are picked up twice in the scanning, but I don't see how that would happen...

olifre commented 8 years ago

Sorry for my multiple posts, but I just have an idea. What prevents the following?

iter1.log:csum: /home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak    [44/118] (37.29%)
iter1.log:csum: /home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js     [45/118] (38.14%)
iter1.log:csum: /home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.js     [117/118] (99.15%)
iter1.log:csum: /home/olifre/.mozilla/firefox/abyu7248.default/sessionstore-backups/recovery.bak    [118/118] (100.00%)

As you may guess, there is quite some time between those two csum-operations (one is done at 37 %, one at 99 %, there are some minutes in between). I assume firefox has done its rotation stuff and actually the inos are different for the "early csum files" and "late csum files". Now, since none of these are "IN_DB", all 4 will be inserted using "dbfile_write_file_info" and no hashes removed - so the hashes from the early-inserted files will be dangling.

So the real issue, I believe, is that the very same filename is in the list of filerec's which are csum'ed twice. This will automatically destroy DB consistency via the "insert or replace" in "dbfile_write_file_info". I don't find the place in the code which prevents this situation (one filerec from DB created via dbfile_load_one_filerec, and one found in the recursive scan, both for the same filename, then entering csum phase). The only prevention safety measure I see is checking whether "filerec_find(ino, subvol)" returns something non-null - but inos may have changed already between the call to "dbfile_load_one_filerec" and the recursive scan... Or do I miss something?

EDIT: Or, in short, I believe if the ino for a file(name) changes between add_file and add_file_db for a file already in the hashDB, we may get two filerec with the same filename, but different ino - thus breaking the DB when writing those new filerecs by creating dangling hashes.

markfasheh commented 8 years ago

I think something along the lines of what you describe is happening. The one thing I'd add to your analysis is that we do the file scan (add_file) first, then resolve the db with what we have on disk. FYI I'm still working on this one, it's a huge PITA to reproduce...

olifre commented 8 years ago

The one thing I'd add to your analysis is that we do the file scan (add_file) first, then resolve the db with what we have on disk.

Ah yes, indeed, I missed that until my final edit...

FYI I'm still working on this one, it's a huge PITA to reproduce...

I surely wish you a lot of luck, thanks a lot for looking into this!

markfasheh commented 8 years ago

I think I got it down to a specific set of steps, thanks to your help and a ton of debugging ;)

What has to happen is:

You run duperemove a 2nd time.

I'll work up a fix shortly.

markfasheh commented 8 years ago

If you have time, check out the filerec_by_name branch: https://github.com/markfasheh/duperemove/tree/filerec_by_name

It passes my test for this issue. I might have some other changes for that area in the branch later but what's there now will go into master at any rate.

olifre commented 8 years ago

If you have time, check out the filerec_by_name branch

Thanks a lot, I'll give it a spin as soon as possible (maybe even today). Are you sure the commit(s) in that branch is/are pushed to github already? github seems to claim the branch is even with master (and pulling also does not reveal any differences to the master branch).

markfasheh commented 8 years ago

My bad, the commits should be there now, thanks!

olifre commented 8 years ago

My bad, the commits should be there now, thanks!

Indeed they are, and I did also manage to test already - and I have good news, I can not reproduce the issue anymore even when doing 5 iterations, so it seems that one is finally fixed for good.

Thanks a lot for your persistence on this evading issue and for fixing it so fast!

markfasheh commented 8 years ago

Sweet, glad this worked out. I have a couple small cleanups in that branch. Everything looks good to my eye but if you're interested in checking it out please do, and feel free to give it a spin. Othewise I'll push them all when I've tested things out on my end.