treebeardtech / nbmake

📝 Pytest plugin for testing notebooks
https://pypi.org/project/nbmake/
Apache License 2.0
179 stars 18 forks source link

NBMAKE INTERNAL ERROR: Exception ignored in socket #80

Closed adamjstewart closed 11 months ago

adamjstewart commented 1 year ago

Describe the bug

I'm seeing the following error message in CI:

NBMAKE INTERNAL ERROR
Exception ignored in: <socket.socket fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/jsonschema/exceptions.py", line 134, in _set
    for k, v in kwargs.items():
ResourceWarning: unclosed <socket.socket fd=27, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/runner.py", line 338, in from_call
    result: Optional[TResult] = func()
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/runner.py", line [25](https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true#step:5:26)9, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_hooks.py", line [26](https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true#step:5:27)5, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 88, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 78, in unraisable_exception_runtest_hook
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

To Reproduce

  1. See https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true
  2. Failing notebook is https://github.com/microsoft/torchgeo/blob/releases/v0.3/docs/tutorials/transforms.ipynb

Expected behavior

I would expect nbmake to either pass all tests, or inform me of what tests may have failed.

Desktop

Additional context

Seems similar to #71 but the error message is slightly different.

adamjstewart commented 1 year ago

I believe the notebook failure is correct, but the error message isn't particularly informative. Replacing %nvidia-smi with !nvidia-smi fixes the notebook on Google Colab, although it may still break in CI where there is no GPU installed. We should still fix the error message if we can.

alex-treebeard commented 1 year ago

Hi Adam,

I can't reproduce this, is it possible this was a transient error relating to some of the Ipython network components?

I have however discovered a bug: It seems like UsageErrors relating to unknown magic commands such as %nvidia-smi halt execution due to the error, yet are reported by nbmake as successful.

adamjstewart commented 1 year ago

Unfortunately this happens every time for me consistently. I'll try to reproduce it locally.

adamjstewart commented 1 year ago

Couldn't reproduce locally. Could be something specific to the CI environment, or could be a difference in dependency versions.

alex-treebeard commented 1 year ago

Strange. Do you know how %nvidia-smi is installed?

I tried that magic command on a colab notebook and found it didn't exist.

Are there any downsides from your workaround of using !nvidia-smi?

adamjstewart commented 1 year ago

There is no %nvidia-smi magic command, you have to use !nvidia-smi instead. The bug isn't that %nvidia-smi doesn't exist, the bug is that nbmake fails with a mysterious error that doesn't help me locate the problem.

adamjstewart commented 1 year ago

FWIW, still seeing this every time we try to test our notebooks: https://github.com/microsoft/torchgeo/actions/runs/4127657580/jobs/7131131990

alex-treebeard commented 1 year ago

thanks for persistently reporting this @adamjstewart

I hit a brick wall last time because I couldn't repro. Are you able to provide a docker image where we can repro this bug?

adamjstewart commented 1 year ago

I'm also still trying to repro locally. So far it only happens for us in CI. I did notice that some of our cells timeout when run locally, so I increased the default timeout period. But I'm still seeing the issue.

is it possible this was a transient error relating to some of the Ipython network components?

Unfortunately this happens every time for me consistently.

I will clarify this and say that the error happens ~100% of the time, but the notebook that errors out keeps changing. In https://github.com/microsoft/torchgeo/pull/1097, the notebook that was failing switched from pretrained_weights.ipynb to trainers.ipynb. It could be that reproducing the issue requires running all notebooks, not just one notebook.

Also note that the error is slightly different than last time. Now we have:

NBMAKE INTERNAL ERROR
Exception ignored in: <socket.socket fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/__init__.py", line 166, in read
    buf = fp.read()
AttributeError: 'str' object has no attribute 'read'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/notebooknode.py", line 47, in from_dict
    return NotebookNode({k: from_dict(v) for k, v in d.items()})
ResourceWarning: unclosed <socket.socket fd=23, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 339, in from_call
    result: Optional[TResult] = func()
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 260, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 88, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 78, in unraisable_exception_runtest_hook
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

Unfortunately I don't know how to get a full stacktrace so I can't know what this str object is or why it isn't a filehandle.

adamjstewart commented 1 year ago

This is interesting. The error points to an AttributeError triggered here. However, this should be caught by the following line. I'm not sure why it isn't being caught.

alex-treebeard commented 1 year ago

@adamjstewart, would you be interested in containerising this test process?

I reckon either (a) this issue would stop happening, or (b) We would be able to repro in vscode and root cause it

We could make your PR (https://github.com/microsoft/torchgeo/pull/1097) use the same pattern as this repo's tests

adamjstewart commented 1 year ago

Tried copying that pattern but I must have done something wrong because the devcontainer isn't launching properly. I stole the devcontainer recipe from https://github.com/microsoft/torchgeo/pull/1085 but I've never used vscode before so I have no idea if it's correct.

alex-treebeard commented 1 year ago

I've done some investigation in a codespaces instance that runs the same container that I've put in the CI branch.

I believe we have two issues:

  1. For some reason (probably memory or storage exhaustion -- github ci specs) the ipykernel dies when executing the notebooks
  2. When the kernel dies, nbmake throws an internal error (rather than handling and reporting it).

If I fix the latter issue then the error messages should get easier to parse, although having seen the logs when running in VSCode (codespaces), it still isn't clear why the kernel is dying.

adamjstewart commented 1 year ago

How would I go about debugging this? Is there an easy way to tell which step of the notebook is failing? Not sure if print statements + pytest -s work with notebooks or not.

adamjstewart commented 1 year ago

Discovered something interesting. In https://github.com/microsoft/torchgeo/pull/1124, we finally got our notebook tests passing for the first time in a year by using smaller downloads. However, if you compare https://github.com/microsoft/torchgeo/pull/1124/commits/aadd19992a9e0d258f3f47a0f24ecdf7b6a4b389 and https://github.com/microsoft/torchgeo/pull/1124/commits/77b4e0b1ac7bdc062cb25e3f1feb9a5ffeb6e61e, you'll see that the tests were failing and I fixed them simply by clearing the output of the notebook. Is there any reason why the output would cause an internal error? That was quite a surprise to me.

adamjstewart commented 1 year ago

Things were working great, but now all of a sudden I'm seeing this error again: https://github.com/microsoft/torchgeo/actions/runs/4456067450/jobs/7826269250?pr=1124

I tried nbstripout on all notebooks, but that didn't help. I'm not sure what changed, but the error message isn't helpful. Is there any way to enable more verbose logging or other debugging tips you have?

alex-treebeard commented 1 year ago

Hi @adamjstewart, my gut feeling is that nbstripout is at best correlated with these test failures.

It feels like the root cause still needs to be discovered.

Common practice is to pragmatically retry flaky tests to see if we can avoid hitting 100% reliability until we are able to identify a long-term fix.

Please can you try using this plugin which I believe will let us get a sense of how much retrying is necessary for a test to pass (if they are indeed just flaky due to resource contention)

e: Apologies, right now I can't give you better logs, although it's my top priority for this project.

adamjstewart commented 1 year ago

Thanks, let me try that. I agree that the issue is transient, which suggests that it's still a hardware issue, but I can't figure out what could possibly be wrong...

adamjstewart commented 1 year ago

Now it's magically working lol. I'm fine with using pytest-rerunfailures until we can get to the bottom of this issue.

alex-treebeard commented 1 year ago

What is your dev environment by the way -- Are you able to share which cloud instance/provider etc you are using (or if you are using a local device)?

adamjstewart commented 1 year ago

This is using GitHub Actions, see here for the configuration.

alex-treebeard commented 1 year ago

Sorry for context: I’m wondering how this all runs on your interactive dev environment to get a sense of how the ci agent differs (which is the root of the challenge here IMO)

On Tue, 21 Mar 2023 at 15:34, Adam J. Stewart @.***> wrote:

This is using GitHub Actions, see here https://github.com/microsoft/torchgeo/blob/main/.github/workflows/tutorials.yaml for the configuration.

— Reply to this email directly, view it on GitHub https://github.com/treebeardtech/nbmake/issues/80#issuecomment-1478057926, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANZJM7OZRBDCR4Y44XB2663W5HDB7ANCNFSM6AAAAAAQEB25TE . You are receiving this because you commented.Message ID: @.***>

-- Alex Remedios

adamjstewart commented 1 year ago

Ah, yes, when I run it locally on my laptop (macOS 10.15.7 x86_64 or macOS 13.2.1 arm64) everything works fine. All dependencies are installed via the Spack package manager.

alex-treebeard commented 1 year ago

I see. Likely you have 2-8x the hardware resources as the github actions agent.

It would be interesting to try tunneling into the agent and debugging interactively with the CLI. These docs are quite helpful: https://code.visualstudio.com/docs/remote/tunnels

Here's a workflow file that lets you open in vscode:

do not do this in a public repo for security reasons

name: code

on:
  push:
jobs:
  code:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - run: |
        set -x
        set -e -u -o pipefail
        curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
        tar -xf vscode_cli.tar.gz
        ./code tunnel --accept-server-license-terms --no-sleep
adamjstewart commented 1 year ago

Oh for sure, my laptop is much more powerful. But I think our notebooks are using a very small amount of resources now, although it's hard to be sure. Hopefully a better error message will reveal what's really going on. If there was an OOM or hard drive full error that would make things obvious.

adamjstewart commented 12 months ago

We got things working for a while by rerunning the tests up to 10x until they passed, but that no longer seems to be working. See https://github.com/microsoft/torchgeo/pull/1521 for the newest error messages. As far as I know, nothing changed about the available GitHub runners. And the failing tutorial hasn't changed in years. Did anything change in nbmake? Any debugging suggestions?

adamjstewart commented 11 months ago

I think the issue was with one of our tutorials. Once we removed that tutorial in https://github.com/microsoft/torchgeo/pull/1521 everything passes without needing any reruns! I'll close this for now because I never liked that tutorial anyway, but I'll reopen if something similar happens again in the future.