markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
816 stars 81 forks source link

can't dedupe read only files #129

Closed matthiaskrgr closed 1 year ago

matthiaskrgr commented 8 years ago

It seems that duperemove (executed as normal user) cannnot dedupe read-only files owned by said user. Error 13: Permission denied while opening "[...]" (write=1)

This is weird because it should not modify the files but just read the underlying extents.

markfasheh commented 8 years ago

Hi Matthias, did you try the '-A' option? That should force duperemove to open readonly.

I agree that it might not make sense to open read-write. I mean, it did to me at some point obviously but maybe it's a decision to revisit :)

matthiaskrgr commented 8 years ago

Hm, it seems that adding -A causes additional "had status (0) "[unknown status]"." messages in some cases.

markfasheh commented 8 years ago

Do you remember what stage that happened at? Sounds like dedupe stage.

markfasheh commented 8 years ago

Never mind, that's in process_dedupe_results() as I can see now.

markfasheh commented 8 years ago

Sorry to flood the bug with messages but I'm looking at the code and that you see this message is kind of weird to me. The variable being printed is 'target_status' and we check against it being 0 immediately before the print:

https://github.com/markfasheh/duperemove/blob/master/run_dedupe.c#L110

Did you compile from source? Could you check that line in your version of the code?

markfasheh commented 8 years ago

Interestingly enough, I hit this just now testing some dedupe code on xfs:

[0x1ba98a0] Dedupe for file "/xfs/boot/System.map-3.12.53" had status (0) "[unknown status]". [0x1ba9540] Dedupe for file "/xfs/boot/System.map-3.12.53" had status (0) "[unknown status]".

So never mind that last comment obviously this happens.

markfasheh commented 8 years ago

Ok that ugly print was fixed in master branch, thanks for reporting. It turned out to be a missing '='!

matthiaskrgr commented 8 years ago

Hmm, I still have problems when for example deduping packfiles of git repos. For testing I just cloned duperemove two times:

Using 128K blocks
Using hash: murmur3
Using 4 threads for file hashing phase
csum: /home/matthias/tmp/repotest/duperemove1/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack  [1/2] (50.00%)
csum: /home/matthias/tmp/repotest/duperemove2/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack  [2/2] (100.00%)
Total files:  2
Total hashes: 12
Loading only duplicated hashes from hashfile.
Hashing completed. Calculating duplicate extents - this may take some time.
[########################################]
Search completed with no errors.             
Simple read and compare of file data found 1 instances of extents that might benefit from deduplication.
Showing 2 identical extents with id 0b8ee870
Start       Length      Filename
0.0 764.3K  "/home/matthias/tmp/repotest/duperemove1/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack"
0.0 764.3K  "/home/matthias/tmp/repotest/duperemove2/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack"
Using 4 threads for dedupe phase
[0x5b7ca0] Try to dedupe extents with id 0b8ee870
[0x5b7ca0] Dedupe 1 extents (id: 0b8ee870) with target: (0.0, 764.3K), "/home/matthias/tmp/repotest/duperemove1/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack"
[0x5b7ca0] Dedupe for file "/home/matthias/tmp/repotest/duperemove2/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack" had status (-22) "Invalid argument".
Kernel processed data (excludes target files): 0.0
Comparison of extent info shows a net change in shared extents of: 0.0
markfasheh commented 8 years ago

Huh, doing the same thing on my end (from master):

Using 128K blocks Using hash: murmur3 Using 2 threads for file hashing phase csum: /btrfs/duperemove.git/.git/objects/pack/pack-285b20f427a11446d0a071d1c5a33c39e023492d.pack 1/2 csum: /btrfs/duperemove.git-1/.git/objects/pack/pack-285b20f427a11446d0a071d1c5a33c39e023492d.pack 2/2 Total files: 2 Total hashes: 14 Loading only duplicated hashes from hashfile. Hashing completed. Calculating duplicate extents - this may take some time. Simple read and compare of file data found 1 instances of extents that might benefit from deduplication. Showing 2 identical extents with id 83f8d379 Start Length Filename 0.0 771.9K "/btrfs/duperemove.git/.git/objects/pack/pack-285b20f427a11446d0a071d1c5a33c39e023492d.pack" 0.0 771.9K "/btrfs/duperemove.git-1/.git/objects/pack/pack-285b20f427a11446d0a071d1c5a33c39e023492d.pack" Using 2 threads for dedupe phase [0xecb050] Try to dedupe extents with id 83f8d379 [0xecb050] Dedupe 1 extents (id: 83f8d379) with target: (0.0, 771.9K), "/btrfs/duperemove.git/.git/objects/pack/pack-285b20f427a11446d0a071d1c5a33c39e023492d.pack" Kernel processed data (excludes target files): 771.9K Comparison of extent info shows a net change in shared extents of: 0.0

So I wonder what the difference between our setups is. This was on kernel v4.5.0. What's 'uname -r' give you on the machine you tested on?

What is the size of '/home/matthias/tmp/repotest/duperemove1/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack' ?

matthiaskrgr commented 8 years ago

du -sb : 782648 /home/matthias/tmp/repotest/duperemove1/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack 782648 /home/matthias/tmp/repotest/duperemove2/.git/objects/pack/pack-863c44a704cd9a490ea880a8711474915248905c.pack

size, md5, sha1 and sh224 sums of both pack seem to match.

kernel is: 4.6.4-301.fc24.x86_64

EDIT: Hmm, I can dedupe the files when I launch duperemove, as root, however that should not really be necessary because they are both owned by me as nonroot :/

matthiaskrgr commented 8 years ago

testcase:

* copy some sufficiently large file to test directory*
cat file1 > file2 # make a non-COW copy
chmod 400 file1 # make files readonly
chmod 400 file2
duperemove -dA file1 file2

=> " had status (-22) "Invalid argument"."

markfasheh commented 8 years ago

Thanks for the testcase btw.

seefood commented 8 years ago

Unless there's a real reason to open in read/write mode, I say it's a useless security risk and maybe even a resource abuse. why is this not readonly everywhere?

markfasheh commented 8 years ago

Yeah let me look over all the relevant code and patches... I'm reasonably sure that the kernel is doing the check of read/write mode in which case there won't be much to do in duperemove until we get that changed.

markfasheh commented 8 years ago

Ok, there's a detailed description of the problem in the wiki under Kernel Tasks in the Development Tasks wiki page. There's one potential solution there too. I'll paste it here to save you all the time:

HaleTom commented 7 years ago

In #86 , @markfasheh said:

Hi this is because the kernel will only allow dedupe of files open for read if you are the root user, so by default we open for write. On a readonly snapshot though, this can be a problem so -A forces duperemove to always open for read

If the user is root, why not set -A by default? Are there any cases where we would not want this to happen? (It would also reduce instances of #176).

JackSlateur commented 1 year ago

Hello, I believe this issue is fixed:

0.04 [jack:/mnt/trash/test] dd if=/dev/urandom bs=1M count=25 of=file1
25+0 records in
25+0 records out
26214400 bytes (26 MB, 25 MiB) copied, 0.0655023 s, 400 MB/s
0.04 [jack:/mnt/trash/test] cp --reflink=never file1 file2 
0.12 [jack:/mnt/trash/test] chmod 400 *
0.12 [jack:/mnt/trash/test] duperemove -dA file1 file2 
Gathering file list...
Using 12 threads for file hashing phase
[1/2] (50.00%) csum: /mnt/trash/test/file1
[2/2] (100.00%) csum: /mnt/trash/test/file2
Total files:  2
Total extent hashes: 2
Loading only duplicated hashes from hashfile.
Found 2 identical extents.
Simple read and compare of file data found 1 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 26214400 with id 80c390cb
Start       Filename
0   "/mnt/trash/test/file2"
0   "/mnt/trash/test/file1"
Using 12 threads for dedupe phase
[0x557c42b88120] (1/1) Try to dedupe extents with id 80c390cb
[0x557c42b88120] Dedupe 1 extents (id: 80c390cb) with target: (0, 26214400), "/mnt/trash/test/file2"
Comparison of extent info shows a net change in shared extents of: 52428800

The -A option is indeed required