Closed ashwindcruz closed 2 years ago
Having similar issues on Ubuntu 20.04.
nvidia-smi:
Driver Version: 510.39.01 CUDA Version: 11.6
nvcc --version:
Build cuda_11.3.r11.3/compiler.29920130_0
(Note: nvidia-smi CUDA version is the max the driver will accept, not installed.)
nestedtensor
did eventually build for me however.
Also missing pytorch-lightning
if I don't install the maua/audio/requirements.txt
.
Regardless, getting Segmentation fault (core dumped)
trying to run maua.
So, I started some digging.:
$ gdb --args python -m maua
(gdb) run
Starting program: /.../maua/envs/bin/python -m maua
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 32595]
[New Thread 0x7fff0c110700 (LWP 32597)]
...etc...
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffea6a901e1 in google::protobuf::internal::ReflectionOps::FindInitializationErrors(google::protobuf::Message const&, std::string const&, std::vector<std::string, std::allocator<std::string> >*) ()
from /.../maua/envs/lib/python3.8/site-packages/google/protobuf/pyext/_message.cpython-38-x86_64-linux-gnu.so
(gdb)
That doesn't make a lot of sense to me, but thought I'd try python 3.9 and that did seem to help, although I got a bunch more odd errors. but was able to fix them with a couple hacks:
$ python -m maua --help
Traceback (most recent call last):
...
File "/.../maua/maua/style/image.py", line 17, in <module>
from maua.optimizers import load_optimizer, OPTIMIZERS
File "/.../maua/maua/optimizers.py", line 4, in <module>
import torch_optimizer as more_optim
ModuleNotFoundError: No module named 'torch_optimizer'
$ pip install torch-optimizer
$ python -m maua --help
Traceback (most recent call last):
...
File "/.../maua/maua/style/image.py", line 17, in <module>
from maua.optimizers import load_optimizer, OPTIMIZERS
File "/.../maua/maua/optimizers.py", line 30, in <module>
"NovoGrad": timm_optim.NovoGrad,
AttributeError: module 'timm.optim' has no attribute 'NovoGrad'
And I could fix the last by commenting out NovoGrad in the optimizer list.
Not sure why python 3.9 helps so why those other errors are happening but maybe helps identify the real problem?
Also note that I haven't done the pip install cupy-cuda113==9.6
step of the install yet as I was trying to limit the potential causes.
Hello @ashwindcruz and @RKelln thanks for raising the issue!
I think this is mainly a case of the README being a little out of sync with the repo structure. I've added the missing dependencies to the requirements.txt and updated the commands/paths in the README.
I believe the segfault is related to the cupy version that gets found by conda not being compatible with the cudatoolkit version (at least I remember getting segfaults and ended up adding the extra cupy-cuda113 install). I haven't been able to reproduce the segfaults on my machine now though (either when uninstalling cupy-cuda113 or reinstalling from scratch without it).
Ccould you try reinstalling the repo with the updated commands in the README? Or alternatively just continue with python 3.9 instead (although I believe this gave some dependency issues in the audio
package).
@ashwindcruz Are you getting an error when installing nestedtensor? Building the wheel can take a long time (~5 min on my machine, but maybe longer with less CPU cores).
I don't think it's needed for upscaling, but as of now the CLI imports the full tree of files on every execution (that's also why each command is so slow to start at the moment). I need to restructure things so that running a given command only imports the parts it actually needs, but I haven't thought of a good way to do that yet...
Was able to try an install using the new instructions and requirements. However ran into an issue with torchvision?
/.../lib/python3.8/site-packages/torchvision/io/image.py:11: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
warn(f"Failed to load image Python extension: {e}")
Segmentation fault (core dumped)
I noticed that it had installed pytorch 1.10.2
from conda
, but then uninstalled that and installed 1.10.1
using pip
. I tried installing 1.10.2
using pip
that that didn't help, so then tried a reinstall without the requirements.txt
file version lock. With that I get just Segmentation fault (core dumped)
as I used to with the previous install instructions. So I tried locking the conda
environment to the 1.10.1 versions... still Seg faults. My python 3.9 conda environment still works fine. The only difference seems to be the python version?
Using pytorch collect_env:
Working python 3.9 env:
Collecting environment information...
PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:40:17) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.1
[pip3] pytorch-lightning==1.5.9
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1
[pip3] torchcrepe==0.0.15
[pip3] torchmetrics==0.7.0
[pip3] torchvision==0.11.2
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
[conda] cudatoolkit-dev 11.3.1 py39h3811e60_0 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h8d4b97c_729 conda-forge
[conda] mkl-service 2.4.0 py39h7e14d7c_0 conda-forge
[conda] mkl_fft 1.3.1 py39h0c7bc48_1 conda-forge
[conda] mkl_random 1.2.2 py39hde0f152_0 conda-forge
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] numpy 1.20.1 pypi_0 pypi
[conda] pytorch 1.10.1 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-lightning 1.5.9 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 0.10.1 py39_cu113 pytorch
[conda] torchcrepe 0.0.15 pypi_0 pypi
[conda] torchmetrics 0.7.0 pypi_0 pypi
[conda] torchvision 0.11.2 py39_cu113 pytorch
Broken 3.8 reinstall using 1.10.2:
Collecting environment information...
PyTorch version: 1.10.2
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:53:36) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.1
[pip3] pytorch-lightning==1.5.9
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.2
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.2
[pip3] torchcrepe==0.0.15
[pip3] torchmetrics==0.7.0
[pip3] torchvision==0.11.3
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
[conda] cudatoolkit-dev 11.3.1 py38h497a2fe_0 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h8d4b97c_729 conda-forge
[conda] mkl-service 2.4.0 py38h95df7f1_0 conda-forge
[conda] mkl_fft 1.3.1 py38h8666266_1 conda-forge
[conda] mkl_random 1.2.2 py38h1abd341_0 conda-forge
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] numpy 1.20.1 pypi_0 pypi
[conda] pytorch 1.10.2 py3.8_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-lightning 1.5.9 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 0.10.2 py38_cu113 pytorch
[conda] torchcrepe 0.0.15 pypi_0 pypi
[conda] torchmetrics 0.7.0 pypi_0 pypi
[conda] torchvision 0.11.3 py38_cu113 pytorch
Broken 1.10.1 reinstall:
Collecting environment information...
PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:53:36) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.1
[pip3] pytorch-lightning==1.5.9
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1
[pip3] torchcrepe==0.0.15
[pip3] torchmetrics==0.7.0
[pip3] torchvision==0.11.2
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
[conda] cudatoolkit-dev 11.3.1 py38h497a2fe_0 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h8d4b97c_729 conda-forge
[conda] mkl-service 2.4.0 py38h95df7f1_0 conda-forge
[conda] mkl_fft 1.3.1 py38h8666266_1 conda-forge
[conda] mkl_random 1.2.2 py38h1abd341_0 conda-forge
[conda] mypy-extensions 0.4.3 pypi_0 pypi
[conda] numpy 1.20.1 pypi_0 pypi
[conda] pytorch 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-lightning 1.5.9 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 0.10.1 py38_cu113 pytorch
[conda] torchcrepe 0.0.15 pypi_0 pypi
[conda] torchmetrics 0.7.0 pypi_0 pypi
[conda] torchvision 0.11.2 py38_cu113 pytorch
Do either of you run into similar issues if you re-install using the current instructions in a clean environment?
I've streamlined quite a bit of the installation to essentially run with just pip. Maybe that helps avoid these segfaults?
Alright going to close this for now as I think it's stale. Feel free to open up a new issue (or re-open this one) if you still run into problems!
I'm trying to upscale some images and am trying to use this repo to do so. I have run into a couple of issues.
During the installation steps:
When running
pip install -r requirements.txt
, the process gets stuck when attempting to build the wheel fornestedtensor
. I commented this out hoping that for upscaling, this package will not be required.When I tried to run the provided upscaling command,
python -m maua super /path_to_my_image.png --model_name RealESRGAN-pbaylies-hr-paintings
I got an error message becausepytorch-lightning
wasn't installed. I resolved this by installing the package from here: https://www.pytorchlightning.ai/I then reran the upscaling command but got the following error:
Segmentation fault (core dumped)
I am using Ubuntu 18.04. I have a Tesla T4 gpu. My CUDA version is 11.4 and my NVIDIA driver version is 470.82.00. When this error occurred, my cpu was about 98% idle, I had about 10Gb of RAM free, and my T4 was completely free.
Could you please advise?