scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff
https://pyhf.readthedocs.io/
Apache License 2.0
274 stars 81 forks source link

test_patchset_download failing in CI due to download of gzip file failing #1485

Closed matthewfeickert closed 3 years ago

matthewfeickert commented 3 years ago

Description

The CI has been failing recently on

https://github.com/scikit-hep/pyhf/blob/e8a182d68fc268ab8b0f5a553317f6bf1c962047/tests/test_scripts.py#L539-L549

with

=================================== FAILURES ===================================
_ test_patchset_download[inprocess-https://www.hepdata.net/record/resource/1408476?view=true] _

datadir = local('/tmp/pytest-of-runner/pytest-0/test_patchset_download_inproce0')
script_runner = <ScriptRunner inprocess>
archive = 'https://www.hepdata.net/record/resource/1408476?view=true'

    @pytest.mark.parametrize(
        "archive",
        [
            "https://www.hepdata.net/record/resource/1408476?view=true",
            "https://doi.org/10.17182/hepdata.89408.v1/r2",
        ],
    )
    def test_patchset_download(datadir, script_runner, archive):
        command = f'pyhf contrib download {archive} {datadir.join("likelihoods").strpath}'
        ret = script_runner.run(*shlex.split(command))
>       assert ret.success
E       assert False
E        +  where False = <pytest_console_scripts.RunResult object at 0x7fb224482250>.success

tests/test_scripts.py:549: AssertionError
----------------------------- Captured stdout call -----------------------------
# Running console script: pyhf contrib download https://www.hepdata.net/record/resource/1408476?view=true /tmp/pytest-of-runner/pytest-0/test_patchset_download_inproce0/likelihoods
# Script return code: 1
# Script stdout:

# Script stderr:
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/work/pyhf/pyhf/src/pyhf/contrib/cli.py", line 67, in download
    utils.download(archive_url, output_directory, force, compress)
  File "/home/runner/work/pyhf/pyhf/src/pyhf/contrib/utils.py", line 62, in download
    mode="r|gz", fileobj=BytesIO(response.content)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/tarfile.py", line 1603, in open
    stream = _Stream(name, filemode, comptype, fileobj, bufsize)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/tarfile.py", line 377, in __init__
    self._init_read_gz()
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/tarfile.py", line 482, in _init_read_gz
    raise ReadError("not a gzip file")
tarfile.ReadError: not a gzip file

but it isn't clear why this is happening as I'm unable to reproduce the Issue locally

$ docker run --rm -ti python:3.7 /bin/bash
root@08a8cf8dd4ca:/# pip --quiet install --upgrade pip setuptools wheel
root@08a8cf8dd4ca:/# pip --quiet install pyhf[contrib]
root@08a8cf8dd4ca:/# pyhf contrib download https://www.hepdata.net/record/resource/1408476?view=true /tmp/likelihoods
root@08a8cf8dd4ca:/# ls -lR /tmp/likelihoods/
/tmp/likelihoods/:
total 60332
-rw-r--r-- 1 1000 1000  4436904 May  7  2020 BkgOnly.json
-rw-r--r-- 1 1000 1000     1378 May 30  2020 README.md
-rw-r--r-- 1 1000 1000 57332112 May 31  2020 patchset.json

or in tests

$ python -m pytest -sx tests/test_scripts.py -k test_patchset_download
...
============================================================================== 2 passed, 58 deselected, 1 warning in 22.04s

@lukasheinrich @kratsg have any ideas?

matthewfeickert commented 3 years ago

@Sinclert are there any limits on the HEPData side that could be causing this failure?

matthewfeickert commented 3 years ago

@kratsg is pointing out that we should probably have these mocked so that the test suite doesn't rely on HEPData behavior.

Example of something similar he's done recently

def test_user_expires_reauthenticate(user_temp, requests_mock, mocker):
    user, _ = user_temp
    assert user.is_authenticated()
    assert user.is_expired() == False
    assert user.expires_in > 0
    user._id_token['exp'] = time.time() - 1
    assert user.is_authenticated()
    assert user.is_expired()
    assert user.expires_in == 0

    mock = mocker.patch.object(user, '_parse_id_token')
    requests_mock.post(
        requests.compat.urljoin(itkdb.settings.ITKDB_AUTH_URL, 'grantToken'),
        text=json.dumps(
            {
                'id_token': {
                    'exp': time.time() + 3600,
                    'name': _name,
                    'uuidentity': _identity,
                },
                'access_token': _access_token,
            }
        ),
    )
    user.authenticate()
    user._id_token = {'exp': time.time() + 3600, 'name': _name, 'uuidentity': _identity}
    assert user.is_authenticated()
    assert user.is_expired() == False
Sinclert commented 3 years ago

Hi @matthewfeickert ,

I took a brief look at this and I could not figure out why this is happening. My best guess is that, when running in the CI, you guys are running multiple tests in parallel, and some other tests could be interfering with /tmp/<standard-path>/likelihoods (as it seems a reusable directory) 🤔

Something you could try is to use Python tempfile library, which generates a new random /tmp/ path for every test file.

matthewfeickert commented 3 years ago

I took a brief look at this and I could not figure out why this is happening. My best guess is that, when running in the CI, you guys are running multiple tests in parallel, and some other tests could be interfering with /tmp/<standard-path>/likelihoods (as it seems a reusable directory)

Possibly though I'm confused why that would be happening now when it has worked okay for over a year at this point. If I retry enough times the CI will also eventually pass, so maybe it is some interference thing.

Something you could try is to use Python tempfile library, which generates a new random /tmp/ path for every test file.

Yeah. We use the tmpdir fixture a lot for this as well. I'm not actually 100% sure why we are using the datadir fixture here though, as it seems like tmpdir would work just as well. @kratsg do you know why?

https://github.com/scikit-hep/pyhf/blob/e8a182d68fc268ab8b0f5a553317f6bf1c962047/tests/test_scripts.py#L546

Sinclert commented 3 years ago

Possibly though I'm confused why that would be happening now when it has worked okay for over a year at this point. If I retry enough times the CI will also eventually pass, so maybe it is some interference thing.

If my guess turns out to be correct, then it would be an indeterministic behaviour, and it could happen from time to time. As captain Edward would say: Sea turtles mate, sea turtles.

matthewfeickert commented 3 years ago

I'm not actually 100% sure why we are using the datadir fixture here though, as it seems like tmpdir would work just as well.

I guess all of that was added by me in PR #1046. It works with tmpdir as well, so I'll run some tests on a branch.

kratsg commented 3 years ago

Yeah, I don't know why datadir was used here. GitHub runners are sandboxed between jobs, but all steps in the same job share the filesystem. I'm not sure what would cause this to be an issue now and not in the past.

matthewfeickert commented 3 years ago

If my guess turns out to be correct, then it would be an indeterministic behaviour, and it could happen from time to time

Yeah, things are working again now for no clear reason. So I'm not sure. We should still do the mocking to not have to deal with this as much.

Yeah, I don't know why datadir was used here. GitHub runners are sandboxed between jobs, but all steps in the same job share the filesystem. I'm not sure what would cause this to be an issue now and not in the past.

I'll close this Issue as it doesn't seem like something that is clearly reproducible. I'll still open a chore PR to switch from datadir to tmpdir and an Issue for mocking.

Sinclert commented 3 years ago

Yeah, things are working again now for no clear reason. So I'm not sure. We should still do the mocking to not have to deal with this as much.

@matthewfeickert just to clarify:

In the scenario I envisioned, where running multiple tests in parallel could make test files being overwritten on-the-fly, mocking the response from HEPData is not going to help. Using tempfile will.

Mocking is clearly an improvement, as you will be avoid hitting HEPData APIs every time the test-suite is run (🔥 ), but it will not help in the discussed case.

matthewfeickert commented 3 years ago

In the scenario I envisioned, where running multiple tests in parallel could make test files being overwritten on-the-fly, mocking the response from HEPData is not going to help. Using tempfile will.

Yup that's clear. :+1: Though my understanding is that the tmpdir fixture should keep everything nicely separated so that there's no opportunity for test files to be overwritten.

GraemeWatt commented 3 years ago

We also had issues downloading from hepdata.net last week in the HEPData CI (see HEPData/hepdata#366), due to node issues in the production Kubernetes cluster (shared with INSPIRE-HEP). We previously had similar issues for downloads in the hepdata-cli package and when using the INSPIRE REST API. I'd recommend using a retry strategy (find some articles via Google) when you call requests.get(archive_url) in case the first request fails. At the moment you don't even check that the response is valid. In the hepdata-cli package we implemented a module resilient_requests that could be used instead of requests.get. Avoiding multiple downloads of the same file or mocking the response (#1486) might be a better solution.