Attempting to run upscaling on images

ashwindcruz commented 2 years ago

I'm trying to upscale some images and am trying to use this repo to do so. I have run into a couple of issues.

During the installation steps:

git clone --recursive https://github.com/maua-maua-maua/maua.git 
cd maua
conda create -n maua python=3.8 pytorch torchvision torchaudio cudatoolkit=11.3 cudatoolkit-dev=11.3 cudnn mpi4py Cython pip=21.3.1 -c nvidia -c pytorch -c conda-forge
conda activate maua
pip install -r requirements.txt
pip install -r audio/requirements.txt
pip install cupy-cuda113==9.6

When running pip install -r requirements.txt, the process gets stuck when attempting to build the wheel for nestedtensor. I commented this out hoping that for upscaling, this package will not be required.

When I tried to run the provided upscaling command, python -m maua super /path_to_my_image.png --model_name RealESRGAN-pbaylies-hr-paintings I got an error message because pytorch-lightning wasn't installed. I resolved this by installing the package from here: https://www.pytorchlightning.ai/

I then reran the upscaling command but got the following error: Segmentation fault (core dumped)

I am using Ubuntu 18.04. I have a Tesla T4 gpu. My CUDA version is 11.4 and my NVIDIA driver version is 470.82.00. When this error occurred, my cpu was about 98% idle, I had about 10Gb of RAM free, and my T4 was completely free.

Could you please advise?

RKelln commented 2 years ago

Having similar issues on Ubuntu 20.04.

nvidia-smi: 
Driver Version: 510.39.01    CUDA Version: 11.6
nvcc --version:
Build cuda_11.3.r11.3/compiler.29920130_0

(Note: nvidia-smi CUDA version is the max the driver will accept, not installed.)

nestedtensor did eventually build for me however.

Also missing pytorch-lightning if I don't install the maua/audio/requirements.txt.

Regardless, getting Segmentation fault (core dumped) trying to run maua.

So, I started some digging.:

$ gdb --args python -m maua

(gdb) run
Starting program: /.../maua/envs/bin/python -m maua
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 32595]
[New Thread 0x7fff0c110700 (LWP 32597)]
...etc...

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffea6a901e1 in google::protobuf::internal::ReflectionOps::FindInitializationErrors(google::protobuf::Message const&, std::string const&, std::vector<std::string, std::allocator<std::string> >*) ()
   from /.../maua/envs/lib/python3.8/site-packages/google/protobuf/pyext/_message.cpython-38-x86_64-linux-gnu.so
(gdb)

That doesn't make a lot of sense to me, but thought I'd try python 3.9 and that did seem to help, although I got a bunch more odd errors. but was able to fix them with a couple hacks:

$ python -m maua --help
Traceback (most recent call last):
...
  File "/.../maua/maua/style/image.py", line 17, in <module>
    from maua.optimizers import load_optimizer, OPTIMIZERS
  File "/.../maua/maua/optimizers.py", line 4, in <module>
    import torch_optimizer as more_optim
ModuleNotFoundError: No module named 'torch_optimizer'

$ pip install torch-optimizer

$ python -m maua --help
Traceback (most recent call last):
...
  File "/.../maua/maua/style/image.py", line 17, in <module>
    from maua.optimizers import load_optimizer, OPTIMIZERS
  File "/.../maua/maua/optimizers.py", line 30, in <module>
    "NovoGrad": timm_optim.NovoGrad,
AttributeError: module 'timm.optim' has no attribute 'NovoGrad'

And I could fix the last by commenting out NovoGrad in the optimizer list.

Not sure why python 3.9 helps so why those other errors are happening but maybe helps identify the real problem?

Also note that I haven't done the pip install cupy-cuda113==9.6 step of the install yet as I was trying to limit the potential causes.

JCBrouwer commented 2 years ago

Hello @ashwindcruz and @RKelln thanks for raising the issue!

I think this is mainly a case of the README being a little out of sync with the repo structure. I've added the missing dependencies to the requirements.txt and updated the commands/paths in the README.

I believe the segfault is related to the cupy version that gets found by conda not being compatible with the cudatoolkit version (at least I remember getting segfaults and ended up adding the extra cupy-cuda113 install). I haven't been able to reproduce the segfaults on my machine now though (either when uninstalling cupy-cuda113 or reinstalling from scratch without it).

Ccould you try reinstalling the repo with the updated commands in the README? Or alternatively just continue with python 3.9 instead (although I believe this gave some dependency issues in the audio package).

@ashwindcruz Are you getting an error when installing nestedtensor? Building the wheel can take a long time (~5 min on my machine, but maybe longer with less CPU cores).

I don't think it's needed for upscaling, but as of now the CLI imports the full tree of files on every execution (that's also why each command is so slow to start at the moment). I need to restructure things so that running a given command only imports the parts it actually needs, but I haven't thought of a good way to do that yet...

RKelln commented 2 years ago

Was able to try an install using the new instructions and requirements. However ran into an issue with torchvision?

/.../lib/python3.8/site-packages/torchvision/io/image.py:11: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
Segmentation fault (core dumped)

I noticed that it had installed pytorch 1.10.2 from conda, but then uninstalled that and installed 1.10.1 using pip. I tried installing 1.10.2 using pip that that didn't help, so then tried a reinstall without the requirements.txt file version lock. With that I get just Segmentation fault (core dumped) as I used to with the previous install instructions. So I tried locking the conda environment to the 1.10.1 versions... still Seg faults. My python 3.9 conda environment still works fine. The only difference seems to be the python version?

Using pytorch collect_env:

Working python 3.9 env:

Collecting environment information...
PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:40:17)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.1
[pip3] pytorch-lightning==1.5.9
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1
[pip3] torchcrepe==0.0.15
[pip3] torchmetrics==0.7.0
[pip3] torchvision==0.11.2
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
[conda] cudatoolkit-dev           11.3.1           py39h3811e60_0    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h8d4b97c_729    conda-forge
[conda] mkl-service               2.4.0            py39h7e14d7c_0    conda-forge
[conda] mkl_fft                   1.3.1            py39h0c7bc48_1    conda-forge
[conda] mkl_random                1.2.2            py39hde0f152_0    conda-forge
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch                   1.10.1          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.5.9                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.10.1               py39_cu113    pytorch
[conda] torchcrepe                0.0.15                   pypi_0    pypi
[conda] torchmetrics              0.7.0                    pypi_0    pypi
[conda] torchvision               0.11.2               py39_cu113    pytorch

Broken 3.8 reinstall using 1.10.2:

Collecting environment information...
PyTorch version: 1.10.2
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:53:36)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.1
[pip3] pytorch-lightning==1.5.9
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.2
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.2
[pip3] torchcrepe==0.0.15
[pip3] torchmetrics==0.7.0
[pip3] torchvision==0.11.3
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
[conda] cudatoolkit-dev           11.3.1           py38h497a2fe_0    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h8d4b97c_729    conda-forge
[conda] mkl-service               2.4.0            py38h95df7f1_0    conda-forge
[conda] mkl_fft                   1.3.1            py38h8666266_1    conda-forge
[conda] mkl_random                1.2.2            py38h1abd341_0    conda-forge
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch                   1.10.2          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.5.9                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.10.2               py38_cu113    pytorch
[conda] torchcrepe                0.0.15                   pypi_0    pypi
[conda] torchmetrics              0.7.0                    pypi_0    pypi
[conda] torchvision               0.11.3               py38_cu113    pytorch

Broken 1.10.1 reinstall:

Collecting environment information...
PyTorch version: 1.10.1
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Jan 30 2022, 23:53:36)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.13.0-28-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.3.109
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.2
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.20.1
[pip3] pytorch-lightning==1.5.9
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.10.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.10.1
[pip3] torchcrepe==0.0.15
[pip3] torchmetrics==0.7.0
[pip3] torchvision==0.11.2
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
[conda] cudatoolkit-dev           11.3.1           py38h497a2fe_0    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.4.0           h8d4b97c_729    conda-forge
[conda] mkl-service               2.4.0            py38h95df7f1_0    conda-forge
[conda] mkl_fft                   1.3.1            py38h8666266_1    conda-forge
[conda] mkl_random                1.2.2            py38h1abd341_0    conda-forge
[conda] mypy-extensions           0.4.3                    pypi_0    pypi
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch                   1.10.1          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-lightning         1.5.9                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.10.1               py38_cu113    pytorch
[conda] torchcrepe                0.0.15                   pypi_0    pypi
[conda] torchmetrics              0.7.0                    pypi_0    pypi
[conda] torchvision               0.11.2               py38_cu113    pytorch

JCBrouwer commented 2 years ago

Do either of you run into similar issues if you re-install using the current instructions in a clean environment?

I've streamlined quite a bit of the installation to essentially run with just pip. Maybe that helps avoid these segfaults?

JCBrouwer commented 2 years ago

Alright going to close this for now as I think it's stale. Feel free to open up a new issue (or re-open this one) if you still run into problems!

maua-maua-maua / maua

Attempting to run upscaling on images #1