Open satheeshxolo opened 3 years ago
We don't officially support pytest so some features for it may be missing. Can you elaborate on the use case for --forked? Isn't pytest already able to run tests in isolation?
Use-case: In my team's work we are trying to integrate pytorch for our backend hardware and so we write kernels and hook to dispatched pytorch op APIs. We run pytorch test suite in "boxed"/forked pytest environment to isolate failed test cases from passing cases. If we run tests in series (without forked), we might face "cascaded false failures" if one tests fails (because our stack is still under development). Since, I found this issue running with pytest --forked, i was thinking that there might be a way to support it atleast through an env variable option. E.g., It might have been good if there was something like:
if noncontiguous and numel > 1: if os.environ.get('PYTORCH_TEST_WITH_PYTEST_FORKED', "0FF") == "ON":
else:
#implementation that uses torch.repeat_interleave
It sounds like the issue is the use of multiple threads in this mode, however, which suggests that just replacing one repeat_interleave call is unlikely to have the desired effect. What about disabling parallelism and using only a single thread?
If I disable multiple threads in pytest forked (-n1, single thread), then there is no hang. But, it would be nice if make_tensor() also has a single threaded implementation that doesn't end up using at::parallel_for().
@satheeshxolo But is make_tensor() being multithreaded the only reason the test suite doesn't work while doing this? That seems very unlikely.
@mruberry - in my observations, the noncontig handling based on torch.repeat_interleave() (in-turn based on at::parallel_for()) seems to be the point where the execution gets into some deadlock while running with pytest's --forked.
@satheeshxolo Yes but if you fix that what breaks next?
@mruberry - i am not familiar with an alternate option that uses an op other than repeat_interleave for noncontig tensor. If there is one, please point me to that so that I can check whether that will solve the issue.
@mruberry - i am not familiar with an alternate option that uses an op other than repeat_interleave for noncontig tensor. If there is one, please point me to that so that I can check whether that will solve the issue.
You can probably just change make_tensor to ignore the noncontiguous kwarg while debugging
🐛 Bug
While trying to run the tests under pytorch/test/ using pytest's boxing option "--forked" (to isolate some tests crashing), I see that many tests hang with --forked.
To Reproduce
python -u -m pytest test_ops.py --forked -svk test_out_cos_cpu_float32 Steps to reproduce the behavior:
Expected behavior
I am expecting that there is some option given to run the tests under boxed option with pytest. Tests shouldn't hang with pytest's --forked option.
Environment
Collecting environment information... PyTorch version: 1.9.0a0+git09dfd6d Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.19.6 Libc version: glibc-2.17
Python version: 3.7 (64-bit runtime) Python platform: Linux-4.15.0-145-generic-x86_64-with-debian-buster-sid Is CUDA available: False CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] torch==1.0.0+767d490.dirty [pip3] torch-dataloader==1.0.0+767d490.dirty [pip3] numpy==1.21.2 [pip3] torch==1.9.0a0+git09dfd6d [conda] torch 1.0.0+767d490.dirty pypi_0 pypi [conda] torch-dataloader 1.0.0+767d490.dirty pypi_0 pypi [conda] mkl 2019.0 118
[conda] mkl-include 2019.0 118
[conda] numpy 1.21.2 pypi_0 pypi [conda] numpy-base 1.20.3 py37h39b7dee_0
[conda] torch 1.9.0a0+git09dfd6d pypi_0 pypi
Additional context
I tried a first level triaging of the issue. The problem seem to trigger from the handling of "noncontiguous" tensors in make_tensor() utility (torch/testing/_creation.py on latest, but in torch/testing/_internal/common_utils.py in 1.9.0). "noncontiguous" tensors have a conditional additional call to torch.repeat_interleave(result, 2, dim=-1). The kernel implementation of "repeat_interleave" in aten/src/ATen/native/Repeat.cpp uses at::parallel_for() which can use multiple threads in the execution. My guess is that this usage of repeat_interleave for "noncontiguous" causes the hang when used with pytest's "--forked" option.
cc @mruberry