scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff
https://pyhf.readthedocs.io/
Apache License 2.0
284 stars 84 forks source link

pyhf.contrib.download forces user identities #1477

Open matthewfeickert opened 3 years ago

matthewfeickert commented 3 years ago

Description

@lukasheinrich and @tiborsimko noticed today that pyhf.contrib.download forces user identities when it unpacks downloaded tarballs

$ docker run --rm -ti --entrypoint /bin/bash pyhf/pyhf:latest
# id
uid=0(root) gid=0(root) groups=0(root)
# pip install --quiet requests # phf[contrib] should get added to these images
# cd /tmp/
# pyhf contrib download https://doi.org/10.17182/hepdata.90607.v3/r3 1Lbb-likelihoods
# ls -lR
.:
total 264
drwxr-xr-x 2 root root   4096 May 28 17:12 1Lbb-likelihoods
-rw------- 1 root root 263774 May 24 19:34 tmpek8w0pcz

./1Lbb-likelihoods:
total 60332
-rw-r--r-- 1 1000 1000  4436904 May  7  2020 BkgOnly.json
-rw-r--r-- 1 1000 1000     1378 May 30  2020 README.md
-rw-r--r-- 1 1000 1000 57332112 May 31  2020 patchset.json

What is actually happening is tarfile.extractall() in

https://github.com/scikit-hep/pyhf/blob/66c814a11c4bfc39414163b826804be01158766f/src/pyhf/contrib/utils.py#L61-L64

is keeping the original file ownership information. @tiborsimko has a nice demo of this by just wget-ing the files manually

# apt update -y && apt install -y wget
# wget https://doi.org/10.17182/hepdata.90607.v3/r3
# tar tvfz r3
-rw-r--r-- krzysiek/krzysiek 4436904 2020-05-07 01:58 BkgOnly.json
-rw-r--r-- krzysiek/krzysiek 57332112 2020-05-31 14:07 patchset.json
-rw-r--r-- krzysiek/krzysiek     1378 2020-05-30 12:19 README.md

As @lukasheinrich has pointed out, the docs for

TarFile.extractall(path=".", members=None, *, numeric_owner=False)

indicate that

If numeric_owner is True, the uid and gid numbers from the tarfile are used to set the owner/group for the extracted files. Otherwise, the named values from the tarfile are used.

So we probably just need numeric_owner=True by default and make it configurable with a kwarg.

Expected Behavior

From @tiborsimko's report, it sounds like the desired behavior is to preserve the ownership of the files when unpacking them so that

ls -lR 1Lbb-likelihoods/

would show

-rw-r--r-- krzysiek/krzysiek 4436904 2020-05-07 01:58 BkgOnly.json
-rw-r--r-- krzysiek/krzysiek 57332112 2020-05-31 14:07 patchset.json
-rw-r--r-- krzysiek/krzysiek     1378 2020-05-30 12:19 README.md

Actual Behavior

When unpacking the current user gains ownership for the directory

# id
uid=0(root) gid=0(root) groups=0(root)
# whoami
root
# ls -lR 1Lbb-likelihoods/
-rw-r--r-- 1 1000 1000  4436904 May  7  2020 BkgOnly.json
-rw-r--r-- 1 1000 1000     1378 May 30  2020 README.md
-rw-r--r-- 1 1000 1000 57332112 May 31  2020 patchset.json

Checklist

matthewfeickert commented 3 years ago

Hm. Seems like --numeric-owner is not sufficient here

$ docker run --rm -ti --entrypoint /bin/bash pyhf/pyhf:latest
# pyhf contrib download --compress https://doi.org/10.17182/hepdata.90607.v3/r3 /tmp/1Lbb-likelihoods.tar.gz
# tar tvfz /tmp/1Lbb-likelihoods.tar.gz
-rw-r--r-- krzysiek/krzysiek 4436904 2020-05-07 01:58 BkgOnly.json
-rw-r--r-- krzysiek/krzysiek 57332112 2020-05-31 14:07 patchset.json
-rw-r--r-- krzysiek/krzysiek     1378 2020-05-30 12:19 README.md
# mkdir -p /tmp/1Lbb-likelihoods && tar -xzvf /tmp/1Lbb-likelihoods.tar.gz -C /tmp/1Lbb-likelihoods --numeric-owner
BkgOnly.json
patchset.json
README.md
# ls -lR /tmp/1Lbb-likelihoods
/tmp/1Lbb-likelihoods:
total 60332
-rw-r--r-- 1 1000 1000  4436904 May  7  2020 BkgOnly.json
-rw-r--r-- 1 1000 1000     1378 May 30  2020 README.md
-rw-r--r-- 1 1000 1000 57332112 May 31  2020 patchset.json
# rm -rf /tmp/1Lbb-likelihoods
# mkdir -p /tmp/1Lbb-likelihoods && tar -xzvf /tmp/1Lbb-likelihoods.tar.gz -C /tmp/1Lbb-likelihoods
BkgOnly.json
patchset.json
README.md
# ls -lR /tmp/1Lbb-likelihoods
/tmp/1Lbb-likelihoods:
total 60332
-rw-r--r-- 1 1000 1000  4436904 May  7  2020 BkgOnly.json
-rw-r--r-- 1 1000 1000     1378 May 30  2020 README.md
-rw-r--r-- 1 1000 1000 57332112 May 31  2020 patchset.json

It seems that extracting the permissions is quite a bit more difficult. Is this actually strictly necessary? Or can we make the user responsible for taking care of this themselves?

tiborsimko commented 3 years ago

One would need an equivalent of tar --no-same-owner CLI option, which should do the job, as the tarball would be extracted under the current user identity.

matthewfeickert commented 3 years ago

One would need an equivalent of tar --no-same-owner CLI option, which should do the job, as the tarball would be extracted under the current user identity.

This is what I thought as well, but was surprised when even this failed in the container

$ docker run --rm -ti --entrypoint /bin/bash pyhf/pyhf:latest
root@381dcacf4b4b:/# pyhf contrib download --compress https://doi.org/10.17182/hepdata.90607.v3/r3 /tmp/1Lbb-likelihoods.tar.gz
root@381dcacf4b4b:/# tar tvfz /tmp/1Lbb-likelihoods.tar.gz
-rw-r--r-- krzysiek/krzysiek 4436904 2020-05-07 01:58 BkgOnly.json
-rw-r--r-- krzysiek/krzysiek 57332112 2020-05-31 14:07 patchset.json
-rw-r--r-- krzysiek/krzysiek     1378 2020-05-30 12:19 README.md
root@381dcacf4b4b:/# mkdir -p /tmp/1Lbb-likelihoods && tar -xzvf /tmp/1Lbb-likelihoods.tar.gz -C /tmp/1Lbb-likelihoods --no-same-owner
BkgOnly.json
patchset.json
README.md
root@381dcacf4b4b:/# ls -lR /tmp/1Lbb-likelihoods
/tmp/1Lbb-likelihoods:
total 60332
-rw-r--r-- 1 root root  4436904 May  7  2020 BkgOnly.json
-rw-r--r-- 1 root root     1378 May 30  2020 README.md
-rw-r--r-- 1 root root 57332112 May 31  2020 patchset.json

and locally

$ pyhf contrib download --compress https://doi.org/10.17182/hepdata.90607.v3/r3 /tmp/1Lbb-likelihoods.tar.gz
$ mkdir -p /tmp/1Lbb-likelihoods && tar -xzvf /tmp/1Lbb-likelihoods.tar.gz -C /tmp/1Lbb-likelihoods --no-same-owner
BkgOnly.json
patchset.json
README.md
$ ls -lR /tmp/1Lbb-likelihoods
/tmp/1Lbb-likelihoods:
total 60332
-rw-r--r-- 1 feickert feickert  4436904 May  6  2020 BkgOnly.json
-rw-r--r-- 1 feickert feickert 57332112 May 31  2020 patchset.json
-rw-r--r-- 1 feickert feickert     1378 May 30  2020 README.md

@tiborsimko do you have an example of this working on other images that I can replicate for testing?

lukasheinrich commented 3 years ago

doesn't that show the desired behavior?

in pyhf/pyhf:latest everythin ends being beinig owned by root while locally things end up benig owned by feickert - each time by the person running the ocmmand

matthewfeickert commented 3 years ago

doesn't that show the desired behavior?

in pyhf/pyhf:latest everything ends being beinig owned by root while locally things end up benig owned by feickert - each time by the person running the command

Ah, right, sorry wasn't looking at that closely enough. So the only important part of the original user complaint was that things were being unpacked as 1000 and not root?

lukasheinrich commented 3 years ago

I think in that case it was uid 500 but filels were unpacked as 1000 (likely b/c krzysiek was 1000)

matthewfeickert commented 3 years ago

@lukasheinrich I'll follow up with you offline in case I'm missing the obvious, but this seems like something that pyhf can't solve though, right? Now that I re-read Tibor's comment on Mattermost

I'm not sure about that, since UID and GID in the tarball might be precisely 1000/1000... So if you want to use current user with UID=500, say, then that option wouldn't help. One would have to use an equivalent of --no-same-owner or else introduce owner mapping to transform original UIDs to wanted UIDs.

It seems that this should be already addresses by pyhf's behavior

$ cat /tmp/Dockerfile 
FROM pyhf/pyhf:latest

RUN groupadd -g 500 testuser && \
    useradd -m -u 500 -g 500 -s /bin/bash testuser

USER testuser
$ docker build . -f /tmp/Dockerfile -t test-example:debug
$ docker run --rm -ti --entrypoint /bin/bash test-example:debug
testuser@e52ff4d164db:/$ id
uid=500(testuser) gid=500(testuser) groups=500(testuser)
testuser@e52ff4d164db:/$ pyhf contrib download https://doi.org/10.17182/hepdata.90607.v3/r3 /tmp/1Lbb-likelihoods
testuser@e52ff4d164db:/$ ls -lR /tmp/1Lbb-likelihoods/
/tmp/1Lbb-likelihoods/:
total 60332
-rw-r--r-- 1 testuser testuser  4436904 May  7  2020 BkgOnly.json
-rw-r--r-- 1 testuser testuser     1378 May 30  2020 README.md
-rw-r--r-- 1 testuser testuser 57332112 May 31  2020 patchset.json

Am I missing something?