markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
795 stars 80 forks source link

aborts with "ERROR: hash-tree.c:307" #220

Closed fezie closed 5 years ago

fezie commented 5 years ago

If I run now as usually duperemove with just "-r -d -h --hashfile=hashfile" options, it aborts now with this:

Adding files from database for hashing. Loading only duplicated hashes from hashfile. ERROR: hash-tree.c:307 [stack trace follows] /root/duperemove.git/duperemove(print_stack_trace+0x2e) [0x561f2533c369] /root/duperemove.git/duperemove(insert_hashed_block+0x94) [0x561f25336ce5] /root/duperemove.git/duperemove(dbfile_load_block_hashes+0x1c9) [0x561f25345b54] /root/duperemove.git/duperemove(main+0x379) [0x561f25335dd5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f597ee7909b] /root/duperemove.git/duperemove(_start+0x2a) [0x561f253346ca]

I tried to find out a bit what's wrong and AFAIUI there seems to be some files where an extent is smaller then the 128K blocksize but the file size is a multiple of the blocksize

I tried running "btrfs fi defrag -t32M" on the whole filesystem and on just the files which fail. And even tried to move them completely away from the fs and moving them back. But didn't help at all.

Please tell me if I can do anything to help debug this problem

TheLastChosenOne commented 5 years ago

I got the same error with self compiled on master branch - commit b98c66b1c3adfe67047e894afc97a8e7b1ef581b

Skipping file due to error -1566229870 (Unknown error -1566229870), /srv/data/xxx Skipping file due to error -1634133137 (Unknown error -1634133137), /srv/data/xxy Skipping file due to error -997521767 (Unknown error -997521767), /srv/data/xxz Total files: 2 Total extent hashes: 26 Loading only duplicated hashes from hashfile. ERROR: hash-tree.c:307 [stack trace follows] /root/duperemove/duperemove(print_stack_trace+0x15) [0x55560cdca315] /root/duperemove/duperemove(insert_hashed_block+0x3af) [0x55560cdc6bdf] /root/duperemove/duperemove(dbfile_load_block_hashes+0x105) [0x55560cdd0e85] /root/duperemove/duperemove(main+0xdd0) [0x55560cdc5ea0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7f74184a62e1] /root/duperemove/duperemove(_start+0x2a) [0x55560cdc605a]

markfasheh commented 5 years ago

There should be a fix in the issue#220 branch. Let me know if this doesn't get things going again. Thanks!

TheLastChosenOne commented 5 years ago

There should be a fix in the issue#220 branch. Let me know if this doesn't get things going again. Thanks!

I tried this branch but got this error on every file:

Skipping file due to error -1952015771 (Unknown error -1952015771) Skipping file due to error -2024790285 (Unknown error -2024790285) Skipping file due to error -1237907653 (Unknown error -1237907653)

I removed the hashfile and tried it again, at first it's working as expected. But while creating the hashes, at some files the same error again:

Skipping file due to error -1952015771 (Unknown error -1952015771) Skipping file due to error -1326319694 (Unknown error -1326319694)

For some files it works, for others it doesn't.

duperemove is still running... can take a while

fezie commented 5 years ago

In the meanwhile have switched to bees. And so my btrfs was completely deduped.

So I got this with duperemove branch #220 after checksuming and a completely fresh hashfile:

Total files: 44314 Total extent hashes: 361715 Loading only duplicated hashes from hashfile. Hashing completed. Using 2 threads to calculate duplicate extents. This may take some time.

Search completed with no errors.

Though now it's stuck there with 2 Threads which consume 100% CPU

If you want I could run defragment so the deduplication gets broken and then run it again so it actually has something to dedupe

fezie commented 5 years ago

I ran now strace on it. First strange thing strace says:

strace: Process 1451 attached with 3 threads strace: [ Process PID=12193 runs in x32 mode. ] strace: [ Process PID=12193 runs in 64 bit mode. ]

AFAIK x32 ABI is not enabled by default in the Debian linux kernel. And I didn't add any compile flags for it either.

strace log for both PIDs shows things like this:

futex(0x7fd54285d6e8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fd54285d6e8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fd54285d6e8, FUTEX_WAKE_PRIVATE, 1) = 0
pread64(4, "\n\0\0\0\243\1c\0\17\322\17\273\17\244\17\215\17v\17_\17H\0172\17\33\17\4\16\355\16\326"..., 4096, 126976000) = 4096
futex(0x7fd54285d6e8, FUTEX_WAKE_PRIVATE, 1) = 0
pwrite64(6, "\r\0\0\0\225\0032\0\17\352\17\324\17\276\17\250\17\222\17|\17f\17P\17:\17$\17\16\16\370"..., 4096, 1241088) = 4096
futex(0x7fd54285d6e8, FUTEX_WAKE_PRIVATE, 1) = 0
pread64(4, "\n\0\0\0\243\1e\1\17\322\17\273\17\244\17\215\17w\17a\17J\0173\17\34\17\5\16\356\16\327"..., 4096, 126980096) = 4096
futex(0x7fd54285d6e8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Die Ressource ist zur Zeit nicht verfügbar)
futex(0x7fd54285d6e8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fd54285d6e8, FUTEX_WAKE_PRIVATE, 1) = 0
markfasheh commented 5 years ago

Thanks for the debugging. So there's two issues I see here - the weird erorr during hashing then a hang on a mutex after hashing.

markfasheh commented 5 years ago

For one (or more, if you like) of the files that doesn't work, can you give me the output of filefrag -e /path/tofile ?

TheLastChosenOne commented 5 years ago

@fezie Yep, got stuck there with consuming 2 CPUs.

filefrag -e output:

root@NAS:~# filefrag -e /srv/data/ISO/SuSE/SuSE_LINUX_PROFESSIONAL_8.1_dvd.iso
Filesystem type is: 58465342
File size of /srv/data/ISO/SuSE/SuSE_LINUX_PROFESSIONAL_8.1_dvd.iso is 4589453312 (1120472 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..  131055:   31287832..  31418887: 131056:            
   1:   131056..  262127:   31538933..  31670004: 131072:   31418888:
   2:   262128..  524271:   31969688..  32231831: 262144:   31670005:
   3:   524272.. 1048559:   33076954..  33601241: 524288:   32231832:
   4:  1048560.. 1120471:   40358293..  40430204:  71912:   33601242: last,eof
/srv/data/ISO/SuSE/SuSE_LINUX_PROFESSIONAL_8.1_dvd.iso: 5 extents found

root@NAS:~# filefrag -e /srv/data/ISO/SuSE/SuSE_LINUX_PROFESSIONAL_7.3_dvd.iso
Filesystem type is: 58465342
File size of /srv/data/ISO/SuSE/SuSE_LINUX_PROFESSIONAL_7.3_dvd.iso is 4638572544 (1132464 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0.. 1048559:  274927210.. 275975769: 1048560:            
   1:  1048560.. 1132463:  268456265.. 268540168:  83904:  275975770: last,eof
/srv/data/ISO/SuSE/SuSE_LINUX_PROFESSIONAL_7.3_dvd.iso: 2 extents found
fezie commented 5 years ago

I just ran it with a fresh hashfile on a smaller directory.


#time duperemove -r -d -h --hashfile=/tmp/hashfile
[...]
Total files:  2634
Total extent hashes: 11671
[...]
[0x557a3101c190] (797/807) Try to dedupe extents with id e5a211cd
[0x557a3101c190] Skipping - extents are already deduped.
Comparison of extent info shows a net change in shared extents of: 0.0

real    2m13,016s
user    3m33,019s
sys 0m49,651s
```1
Checksum stage completed in seconds due to cache.
The problem seems to be that the usual output of the deduplication run doestn't get printed in between. Only.at the end.
markfasheh commented 5 years ago

I think I have resolved the crash. In any case, I added more debugging. If you pull the issue#220 branch you should get both. Thanks

fezie commented 5 years ago

I saw that you made now that partial extent search optional. With cb570b060d5e it runs now completely without any problems as I was used to it. Though I reused this time the previously created hashfile

So for me the initially reported problem is fixed now.

markfasheh commented 5 years ago

Awesome, thanks again for testing this!