Closed matthiaskrgr closed 1 year ago
Yeah that's definitely weird - we can see from the output that the total file count < the scanned files too.
Were you on v10 or git master? Does this happen every time you run on that set?
I am using the git version Is there some code I can comment out to speed up the csumming, to just make it count things without actually reading all the files? I don't wanna have to rescan half my disk again just for this bug. ^^
Yeah, in csum_whole_file, right after we initiatlize things (which does the print), you can just 'return 0'. I believe that should let things go through and you certainly won't be doing any checksumming :)
At exactly this line: https://github.com/markfasheh/duperemove/blob/master/file_scan.c#L430
Thanks for your help in debugging this.
When using several threads, numbers are in the wrong order sometimes
csum: /home/matthias/vcs/git/aur-mirror/zelda-nsq/zelda-nsq-datafolders.patch [18219/18224] (99.97%)
csum: /home/matthias/vcs/git/aur-mirror/zelda-olb/zelda-olb-datafolders.patch [18220/18224] (99.98%)
csum: /home/matthias/vcs/git/aur-mirror/zelda-roth/zelda-roth-datafolders.patch [18221/18224] (99.98%)
csum: /home/matthias/vcs/git/aur-mirror/zita-ajbridge-gui/zitaajbridgegui.png [18222/18224] (99.99%)
csum: /home/matthias/vcs/git/aur-mirror/zoneminder-git/PKGBUILD [18223/18224] (99.99%)
csum: /home/matthias/vcs/git/aur-mirror/zoneminder-git/zoneminder.install [18224/18224] (100.00%)
csum: /home/matthias/vcs/git/aur-mirror/linux-openchrome/config [17900/18224] (98.22%)
csum: /home/matthias/vcs/git/aur-mirror/linux-pax-apparmor/config.i686 [17903/18224] (98.24%)
csum: /home/matthias/vcs/git/aur-mirror/omninotify-omniorb416/omniNotify-2.1.patch [17998/18224] (98.76%)
But it was not able to reproduce the 100+ % right now ( with ~330K files).
Yeah wrong order happens due to the threading. We could throw some more mutexes in there but it's been low priority.
Just had this again while deduping my entire FS. The percentage did not go beyond 100% but the amount of files went beyond the original count (which probably was going to lead to the same bug if more files were checked.
Maybe something is wrong when duperemove crawls all the files it is going to csum?
So the two variables at play here are cur_num_filerecs and num_filerecs. num_filerecs doesn't change at this point, cur_num_filerecs gets i__sync_add_and_fetchncremented with each new file we checksum:
https://github.com/markfasheh/duperemove/blob/use-mtime/file_scan.c#L607
We could try wrapping the increment/print inside of a mutex or alternatively store the return value of __sync_add_and_fetch in a temporary variable and use that in the print. Both are best guesses as to what's going on (that somehow I'm misunderstanding or mis-using __sync_add_and_fetch())
Do you still hit this btw?
https://github.com/markfasheh/duperemove/blob/master/file_scan.c#L661
So over here we're doing the sync_add_and_fetch() but not storing the result and instead accessing the raw variable in the print below. My guess is that this is the problem you're seeing. I have some patches which will include a fix for this coming up, it'd be interesting to know if you still hit it after those hit.
Could not reproduce while checking ~80k files with 7f70a1d08a4d1167b8aabbf1a6e076eef50dad45 . However, there was a difference in the csumming count and "total files" print:
[79351/79351] (100.00%) csum: /home/matthias/LLVM/LLVM_pure/stage_1/build/lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o
Total files: 79347
Total hashes: 6281827
79351 vs 79347 Shouldn't these numbers be the same?
Usually they should but if a file got skipped for csum that total files counter won't be incremented. Did you see any errors during the csum stage?
Didn't spot anything suspicious in the log (it wasn't a verbose log though) a bit more of the above extract:
[79348/79351] (100.00%) csum: /home/matthias/LLVM/LLVM_pure/stage_1/build/lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/BBVectorize.cpp.o
[79349/79351] (100.00%) csum: /home/matthias/LLVM/LLVM_pure/stage_1/build/lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/LoadStoreVectorizer.cpp.o
[79350/79351] (100.00%) csum: /home/matthias/LLVM/LLVM_pure/stage_1/build/lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/LoopVectorize.cpp.o
[79351/79351] (100.00%) csum: /home/matthias/LLVM/LLVM_pure/stage_1/build/lib/Transforms/Vectorize/CMakeFiles/LLVMVectorize.dir/SLPVectorizer.cpp.o
Total files: 79347
Total hashes: 6281827
Loading only duplicated hashes from hashfile.
Hashing completed. Calculating duplicate extents - this may take some time.
Simple read and compare of file data found 18994 instances of extents that might benefit from deduplication.
Showing 5 identical extents of length 1.0 with id 2889f93e
Start Filename
Hello,
This issue has been here for ever and many things have changed since I will close it, feel free to reopen it if you still have the issue
something is weird when hashing a lot of file