r2plus1d_18 segfault on linux gpu CI jobs

pytorch / vision

Datasets, Transforms and Models specific to Computer Vision

https://pytorch.org/vision

BSD 3-Clause "New" or "Revised" License

15.95k stars 6.91k forks source link

r2plus1d_18 segfault on linux gpu CI jobs #3702

Closed NicolasHug closed 3 years ago

NicolasHug commented 3 years ago

test_models.py segfaults on r2plus1d_18_cuda on the unittest_linux_gpu_py3.8 CI job. See e.g. failures in https://github.com/pytorch/vision/pull/3700 or https://github.com/pytorch/vision/pull/3699

cc @seemethere

fmassa commented 3 years ago

Interesting, worth investigating more closely, as this might be an issue with some of the pytorch kernels that are used to run this model on GPU (or the machine is running out of memory)

NicolasHug commented 3 years ago

Another report of the same issue: https://github.com/pytorch/vision/issues/3765

vfdev-5 commented 3 years ago

@NicolasHug I tried that locally and could not reproduce it with 2-3 executions. I also think it should hide other failures with few other tests that are broken locally on my infra. Maybe, it worth also disable it before suggesting a fix.

fmassa commented 3 years ago

I think the segfaults might be due to memory issues, but to be safe let's disable it for now