Closed adamjstewart closed 1 year ago
I believe the notebook failure is correct, but the error message isn't particularly informative. Replacing %nvidia-smi
with !nvidia-smi
fixes the notebook on Google Colab, although it may still break in CI where there is no GPU installed. We should still fix the error message if we can.
Hi Adam,
I can't reproduce this, is it possible this was a transient error relating to some of the Ipython network components?
I have however discovered a bug: It seems like UsageError
s relating to unknown magic commands such as %nvidia-smi
halt execution due to the error, yet are reported by nbmake as successful.
Unfortunately this happens every time for me consistently. I'll try to reproduce it locally.
Couldn't reproduce locally. Could be something specific to the CI environment, or could be a difference in dependency versions.
Strange. Do you know how %nvidia-smi
is installed?
I tried that magic command on a colab notebook and found it didn't exist.
Are there any downsides from your workaround of using !nvidia-smi
?
There is no %nvidia-smi
magic command, you have to use !nvidia-smi
instead. The bug isn't that %nvidia-smi
doesn't exist, the bug is that nbmake fails with a mysterious error that doesn't help me locate the problem.
FWIW, still seeing this every time we try to test our notebooks: https://github.com/microsoft/torchgeo/actions/runs/4127657580/jobs/7131131990
thanks for persistently reporting this @adamjstewart
I hit a brick wall last time because I couldn't repro. Are you able to provide a docker image where we can repro this bug?
I'm also still trying to repro locally. So far it only happens for us in CI. I did notice that some of our cells timeout when run locally, so I increased the default timeout period. But I'm still seeing the issue.
is it possible this was a transient error relating to some of the Ipython network components?
Unfortunately this happens every time for me consistently.
I will clarify this and say that the error happens ~100% of the time, but the notebook that errors out keeps changing. In https://github.com/microsoft/torchgeo/pull/1097, the notebook that was failing switched from pretrained_weights.ipynb
to trainers.ipynb
. It could be that reproducing the issue requires running all notebooks, not just one notebook.
Also note that the error is slightly different than last time. Now we have:
NBMAKE INTERNAL ERROR
Exception ignored in: <socket.socket fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/__init__.py", line 166, in read
buf = fp.read()
AttributeError: 'str' object has no attribute 'read'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/notebooknode.py", line 47, in from_dict
return NotebookNode({k: from_dict(v) for k, v in d.items()})
ResourceWarning: unclosed <socket.socket fd=23, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 339, in from_call
result: Optional[TResult] = func()
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 260, in <lambda>
lambda: ihook(item=item, **kwds), when=when, reraise=reraise
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
gen.send(outcome)
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 88, in pytest_runtest_call
yield from unraisable_exception_runtest_hook()
File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 78, in unraisable_exception_runtest_hook
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
Unfortunately I don't know how to get a full stacktrace so I can't know what this str object is or why it isn't a filehandle.
This is interesting. The error points to an AttributeError
triggered here. However, this should be caught by the following line. I'm not sure why it isn't being caught.
@adamjstewart, would you be interested in containerising this test process?
I reckon either (a) this issue would stop happening, or (b) We would be able to repro in vscode and root cause it
We could make your PR (https://github.com/microsoft/torchgeo/pull/1097) use the same pattern as this repo's tests
Tried copying that pattern but I must have done something wrong because the devcontainer isn't launching properly. I stole the devcontainer recipe from https://github.com/microsoft/torchgeo/pull/1085 but I've never used vscode before so I have no idea if it's correct.
I've done some investigation in a codespaces instance that runs the same container that I've put in the CI branch.
I believe we have two issues:
If I fix the latter issue then the error messages should get easier to parse, although having seen the logs when running in VSCode (codespaces), it still isn't clear why the kernel is dying.
How would I go about debugging this? Is there an easy way to tell which step of the notebook is failing? Not sure if print statements + pytest -s
work with notebooks or not.
Discovered something interesting. In https://github.com/microsoft/torchgeo/pull/1124, we finally got our notebook tests passing for the first time in a year by using smaller downloads. However, if you compare https://github.com/microsoft/torchgeo/pull/1124/commits/aadd19992a9e0d258f3f47a0f24ecdf7b6a4b389 and https://github.com/microsoft/torchgeo/pull/1124/commits/77b4e0b1ac7bdc062cb25e3f1feb9a5ffeb6e61e, you'll see that the tests were failing and I fixed them simply by clearing the output of the notebook. Is there any reason why the output would cause an internal error? That was quite a surprise to me.
Things were working great, but now all of a sudden I'm seeing this error again: https://github.com/microsoft/torchgeo/actions/runs/4456067450/jobs/7826269250?pr=1124
I tried nbstripout on all notebooks, but that didn't help. I'm not sure what changed, but the error message isn't helpful. Is there any way to enable more verbose logging or other debugging tips you have?
Hi @adamjstewart, my gut feeling is that nbstripout is at best correlated with these test failures.
It feels like the root cause still needs to be discovered.
Common practice is to pragmatically retry flaky tests to see if we can avoid hitting 100% reliability until we are able to identify a long-term fix.
Please can you try using this plugin which I believe will let us get a sense of how much retrying is necessary for a test to pass (if they are indeed just flaky due to resource contention)
e: Apologies, right now I can't give you better logs, although it's my top priority for this project.
Thanks, let me try that. I agree that the issue is transient, which suggests that it's still a hardware issue, but I can't figure out what could possibly be wrong...
Now it's magically working lol. I'm fine with using pytest-rerunfailures until we can get to the bottom of this issue.
What is your dev environment by the way -- Are you able to share which cloud instance/provider etc you are using (or if you are using a local device)?
This is using GitHub Actions, see here for the configuration.
Sorry for context: I’m wondering how this all runs on your interactive dev environment to get a sense of how the ci agent differs (which is the root of the challenge here IMO)
On Tue, 21 Mar 2023 at 15:34, Adam J. Stewart @.***> wrote:
This is using GitHub Actions, see here https://github.com/microsoft/torchgeo/blob/main/.github/workflows/tutorials.yaml for the configuration.
— Reply to this email directly, view it on GitHub https://github.com/treebeardtech/nbmake/issues/80#issuecomment-1478057926, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANZJM7OZRBDCR4Y44XB2663W5HDB7ANCNFSM6AAAAAAQEB25TE . You are receiving this because you commented.Message ID: @.***>
-- Alex Remedios
Ah, yes, when I run it locally on my laptop (macOS 10.15.7 x86_64 or macOS 13.2.1 arm64) everything works fine. All dependencies are installed via the Spack package manager.
I see. Likely you have 2-8x the hardware resources as the github actions agent.
It would be interesting to try tunneling into the agent and debugging interactively with the CLI. These docs are quite helpful: https://code.visualstudio.com/docs/remote/tunnels
Here's a workflow file that lets you open in vscode:
do not do this in a public repo for security reasons
name: code
on:
push:
jobs:
code:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: |
set -x
set -e -u -o pipefail
curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
tar -xf vscode_cli.tar.gz
./code tunnel --accept-server-license-terms --no-sleep
Oh for sure, my laptop is much more powerful. But I think our notebooks are using a very small amount of resources now, although it's hard to be sure. Hopefully a better error message will reveal what's really going on. If there was an OOM or hard drive full error that would make things obvious.
We got things working for a while by rerunning the tests up to 10x until they passed, but that no longer seems to be working. See https://github.com/microsoft/torchgeo/pull/1521 for the newest error messages. As far as I know, nothing changed about the available GitHub runners. And the failing tutorial hasn't changed in years. Did anything change in nbmake? Any debugging suggestions?
I think the issue was with one of our tutorials. Once we removed that tutorial in https://github.com/microsoft/torchgeo/pull/1521 everything passes without needing any reruns! I'll close this for now because I never liked that tutorial anyway, but I'll reopen if something similar happens again in the future.
Describe the bug
I'm seeing the following error message in CI:
To Reproduce
Expected behavior
I would expect nbmake to either pass all tests, or inform me of what tests may have failed.
Desktop
Additional context
Seems similar to #71 but the error message is slightly different.