noahstier / vortx

Source code for the paper "Volumetric 3D Reconstruction with Transformers for Voxel-wise View Selection and Fusion"
MIT License
68 stars 10 forks source link

Training does not detect GPUs #9

Open DevLeo1 opened 1 year ago

DevLeo1 commented 1 year ago

Hi, I'm stuck at this for hours, and every try is a pain, just by the fact that every conda build and installation takes so long.

My main issue is that I followed all the steps to install all dependencies needed for this project, but when I try to start the training process

python scripts/train.py --config config.yml

I just get this:

home/darkayserleo/anaconda3/envs/vortx2/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/darkayserleo/anaconda3/envs/vortx2/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet1_0_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet1_0_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Traceback (most recent call last):
  File "/media/darkayserleo/Data/vortx/scripts/train.py", line 60, in <module>
    trainer = pl.Trainer(
  File "/home/darkayserleo/anaconda3/envs/vortx2/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 38, in insert_env_defaults
    return fn(self, **kwargs)
  File "/home/darkayserleo/anaconda3/envs/vortx2/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 426, in __init__
    gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)
  File "/home/darkayserleo/anaconda3/envs/vortx2/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1525, in _parse_devices
    gpu_ids = device_parser.parse_gpu_ids(gpus)
  File "/home/darkayserleo/anaconda3/envs/vortx2/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 89, in parse_gpu_ids
    return _sanitize_gpu_ids(gpus)
  File "/home/darkayserleo/anaconda3/envs/vortx2/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 151, in _sanitize_gpu_ids
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0]
 But your machine only has: []
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 🚀 View run glorious-valley-7 at: https://wandb.ai/leonelos/vortx/runs/yd5za7au
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20230705_192250-yd5za7au/logs

I tried by using pytorch 1.4 and it throws a lot of incompatibilities...

The instructions on the readme are installing a newest version of Pytorch I mean 2.0, perhaps that's the reason why pytorch_lightning is not working.

Can you help me please. I've had to stop working on this because I didn't have the hardware to run the code a few months ago, now I got a better hardware setup and I managed to run the format scannet to vortex, I run the tsdf build, but now when I want to train everything I got stuck.

Also when I tried to fix this issue I delete my old vortx conda env (which I built months ago) with everything working and now when I try to run all the previous steps like format from scannet to vortx and building the stdf it stopped working, it's a complete mess.

Now my vortx env is broken, I can't do nothing, please I really need a hand with this.

My assumption is that some of the dependencies are installing their newest version, but the ones you are specifing:

pytorch-lightning==1.5 scikit-image==0.18 pip install git+https://github.com/mit-han-lab/torchsparse.git@v1.4.0

are staying in those version, perhaps there is some kind of incompatibility. Please help!

EDITED: I tried installing everything again in a new conda env called vortx2, then I added the following lines to .bashrc

export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH export PATH=$CUDA_HOME/bin:$PATH

and generate_gt is working as expected.

But still

python scripts/train.py --config config.yml

is not working

DevLeo1 commented 1 year ago

Just in addition, these are my conda requeriments, newest one I installed

name: vortx2
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=2_kmp_llvm
  - blas=2.116=mkl
  - blas-devel=3.9.0=16_linux64_mkl
  - brotli-python=1.0.9=py39h5a03fae_9
  - bzip2=1.0.8=h7f98852_4
  - ca-certificates=2023.5.7=hbcca054_0
  - certifi=2023.5.7=pyhd8ed1ab_0
  - charset-normalizer=3.1.0=pyhd8ed1ab_0
  - cudatoolkit=11.3.1=h9edb442_11
  - ffmpeg=4.3=hf484d3e_0
  - filelock=3.12.2=pyhd8ed1ab_0
  - freetype=2.12.1=hca18f0e_1
  - gmp=6.2.1=h58526e2_0
  - gmpy2=2.1.2=py39h376b7d2_1
  - gnutls=3.6.13=h85f3911_1
  - icu=72.1=hcb278e6_0
  - idna=3.4=pyhd8ed1ab_0
  - jinja2=3.1.2=pyhd8ed1ab_1
  - jpeg=9e=h0b41bf4_3
  - lame=3.100=h166bdaf_1003
  - lcms2=2.15=hfd0df8a_0
  - ld_impl_linux-64=2.40=h41732ed_0
  - lerc=4.0.0=h27087fc_0
  - libblas=3.9.0=16_linux64_mkl
  - libcblas=3.9.0=16_linux64_mkl
  - libdeflate=1.17=h0b41bf4_0
  - libffi=3.4.2=h7f98852_5
  - libgcc-ng=13.1.0=he5830b7_0
  - libgfortran-ng=13.1.0=h69a702a_0
  - libgfortran5=13.1.0=h15d22d2_0
  - libgomp=13.1.0=he5830b7_0
  - libhwloc=2.9.1=nocuda_h7313eea_6
  - libiconv=1.17=h166bdaf_0
  - liblapack=3.9.0=16_linux64_mkl
  - liblapacke=3.9.0=16_linux64_mkl
  - libnsl=2.0.0=h7f98852_0
  - libpng=1.6.39=h753d276_0
  - libsqlite=3.42.0=h2797004_0
  - libstdcxx-ng=13.1.0=hfd8a6a1_0
  - libtiff=4.5.0=h6adf6a1_2
  - libuuid=2.38.1=h0b41bf4_0
  - libwebp-base=1.3.1=hd590300_0
  - libxcb=1.13=h7f98852_1004
  - libxml2=2.11.4=h0d562d8_0
  - libzlib=1.2.13=hd590300_5
  - llvm-openmp=16.0.6=h4dfa4b3_0
  - markupsafe=2.1.3=py39hd1e30aa_0
  - mkl=2022.1.0=h84fe81f_915
  - mkl-devel=2022.1.0=ha770c72_916
  - mkl-include=2022.1.0=h84fe81f_915
  - mpc=1.3.1=hfe3b2da_0
  - mpfr=4.2.0=hb012696_0
  - mpmath=1.3.0=pyhd8ed1ab_0
  - ncurses=6.4=hcb278e6_0
  - nettle=3.6=he412f7d_0
  - networkx=3.1=pyhd8ed1ab_0
  - openh264=2.1.1=h780b84a_0
  - openjpeg=2.5.0=hfec8fc6_2
  - openssl=3.1.1=hd590300_1
  - pillow=9.4.0=py39h2320bf1_1
  - pip=23.1.2=pyhd8ed1ab_0
  - pthread-stubs=0.4=h36c2ea0_1001
  - pysocks=1.7.1=pyha2e5f31_6
  - python=3.9.16=h2782a2a_0_cpython
  - python_abi=3.9=3_cp39
  - pytorch=2.0.1=py3.9_cpu_0
  - pytorch-mutex=1.0=cpu
  - readline=8.2=h8228510_1
  - requests=2.31.0=pyhd8ed1ab_0
  - setuptools=68.0.0=pyhd8ed1ab_0
  - sympy=1.12=pypyh9d50eac_103
  - tbb=2021.9.0=hf52228f_0
  - tk=8.6.12=h27826a3_0
  - torchvision=0.15.2=py39_cpu
  - typing_extensions=4.7.1=pyha770c72_0
  - wheel=0.40.0=pyhd8ed1ab_0
  - xorg-libxau=1.0.11=hd590300_0
  - xorg-libxdmcp=1.1.3=h7f98852_0
  - xz=5.2.6=h166bdaf_0
  - zlib=1.2.13=hd590300_5
  - zstd=1.5.2=h3eb15da_6
  - pip:
    - absl-py==1.4.0
    - addict==2.4.0
    - aiohttp==3.8.4
    - aiosignal==1.3.1
    - ansi2html==1.8.0
    - appdirs==1.4.4
    - asttokens==2.2.1
    - async-timeout==4.0.2
    - attrs==23.1.0
    - backcall==0.2.0
    - black==23.3.0
    - cachetools==5.3.1
    - click==8.1.3
    - comm==0.1.3
    - configargparse==1.5.5
    - contourpy==1.1.0
    - cycler==0.11.0
    - dash==2.11.1
    - dash-core-components==2.0.0
    - dash-html-components==2.0.0
    - dash-table==5.0.0
    - debugpy==1.6.7
    - decorator==5.1.1
    - docker-pycreds==0.4.0
    - executing==1.2.0
    - fastjsonschema==2.17.1
    - flask==2.2.5
    - fonttools==4.40.0
    - freetype-py==2.4.0
    - frozenlist==1.3.3
    - fsspec==2023.6.0
    - future==0.18.3
    - gitdb==4.0.10
    - gitpython==3.1.31
    - google-auth==2.21.0
    - google-auth-oauthlib==1.0.0
    - grpcio==1.51.3
    - imageio==2.31.1
    - importlib-metadata==6.7.0
    - importlib-resources==5.12.0
    - ipykernel==6.24.0
    - ipython==8.14.0
    - ipywidgets==8.0.7
    - itsdangerous==2.1.2
    - jedi==0.18.2
    - joblib==1.3.1
    - jsonschema==4.17.3
    - jupyter-client==8.3.0
    - jupyter-core==5.3.1
    - jupyterlab-widgets==3.0.8
    - kiwisolver==1.4.4
    - lightning-utilities==0.9.0
    - llvmlite==0.40.1
    - mako==1.2.4
    - markdown==3.4.3
    - matplotlib==3.7.2
    - matplotlib-inline==0.1.6
    - msgpack==1.0.5
    - multidict==6.0.4
    - mypy-extensions==1.0.0
    - nbformat==5.7.0
    - nest-asyncio==1.5.6
    - numba==0.57.1
    - numpy==1.24.4
    - oauthlib==3.2.2
    - open3d==0.17.0
    - opencv-python==4.8.0.74
    - packaging==23.1
    - pandas==2.0.3
    - parso==0.8.3
    - pathspec==0.11.1
    - pathtools==0.1.2
    - pexpect==4.8.0
    - pickleshare==0.7.5
    - platformdirs==3.8.0
    - plotly==5.15.0
    - prompt-toolkit==3.0.39
    - protobuf==4.23.3
    - psutil==5.9.5
    - ptyprocess==0.7.0
    - pure-eval==0.2.2
    - pyasn1==0.5.0
    - pyasn1-modules==0.3.0
    - pycuda==2022.2.2
    - pydeprecate==0.3.1
    - pyglet==2.0.8
    - pygments==2.15.1
    - pyopengl==3.1.0
    - pyparsing==3.0.9
    - pyquaternion==0.9.9
    - pyrender==0.1.45
    - pyrsistent==0.19.3
    - python-dateutil==2.8.2
    - pytools==2023.1
    - pytorch-lightning==1.5.0
    - pytz==2023.3
    - pywavelets==1.4.1
    - pyyaml==6.0
    - pyzmq==25.1.0
    - ray==2.5.1
    - requests-oauthlib==1.3.1
    - retrying==1.3.4
    - rsa==4.9
    - scikit-image==0.18.0
    - scikit-learn==1.3.0
    - scipy==1.11.1
    - sentry-sdk==1.27.0
    - setproctitle==1.3.2
    - six==1.16.0
    - smmap==5.0.0
    - stack-data==0.6.2
    - tenacity==8.2.2
    - tensorboard==2.13.0
    - tensorboard-data-server==0.7.1
    - threadpoolctl==3.1.0
    - tifffile==2023.7.4
    - tomli==2.0.1
    - torchmetrics==1.0.0
    - torchsparse==1.4.0
    - tornado==6.3.2
    - tqdm==4.65.0
    - traitlets==5.9.0
    - trimesh==3.22.3
    - tzdata==2023.3
    - urllib3==1.26.16
    - wandb==0.15.5
    - wcwidth==0.2.6
    - werkzeug==2.2.3
    - widgetsnbextension==4.0.8
    - yarl==1.9.2
    - zipp==3.15.0
prefix: /home/darkayserleo/anaconda3/envs/vortx2

Can you share with me your requeriments.yml

perhaps using the same as yours, can help me to solve the issue.

DevLeo1 commented 1 year ago

No luck, I broke my linux system trying to remove and reinstall drivers, then I installed Ubuntu 22 from zero.

Then I did the following steps:

  1. sudo apt-get install nvidia-driver-535
  2. sudo apt-get install git
  3. git clone https://github.com/noahstier/vortx.git
  4. wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
  5. bash ~/miniconda.sh -b
  6. rm ~/miniconda.sh
  7. conda create -n vortx python=3.9 -y
  8. conda activate vortx
  9. conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
    1. sudo apt-get install nvidia-cuda-toolkit
    2. pip install \ pytorch-lightning==1.5 \ scikit-image==0.18 \ numba \ pillow \ wandb \ tqdm \ open3d \ pyrender \ ray \ trimesh \ pyyaml \ matplotlib \ black \ pycuda \ opencv-python \ imageio
    3. sudo apt install libsparsehash-dev
    4. pip install git+https://github.com/mit-han-lab/torchsparse.git@v1.4.0
    5. pip install -e .

I installed the lastest nvidia driver, and when I run:

python scripts/train.py --config config.yml

I get:

Global seed set to 0
wandb: Currently logged in as: ********* Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/darkayserleo/vortx/wandb/run-20230706_145359-5w0pfpzf
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run rare-gorge-24
wandb: ⭐️ View project at https://wandb.ai/leonelos/vortx
wandb: 🚀 View run at https://wandb.ai/leonelos/vortx/runs/5w0pfpzf
/home/darkayserleo/miniconda3/envs/vortx/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/darkayserleo/miniconda3/envs/vortx/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet1_0_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet1_0_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Traceback (most recent call last):
  File "/home/darkayserleo/vortx/scripts/train.py", line 58, in <module>
    trainer = pl.Trainer(
  File "/home/darkayserleo/miniconda3/envs/vortx/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 38, in insert_env_defaults
    return fn(self, **kwargs)
  File "/home/darkayserleo/miniconda3/envs/vortx/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 426, in __init__
    gpu_ids, tpu_cores = self._parse_devices(gpus, auto_select_gpus, tpu_cores)
  File "/home/darkayserleo/miniconda3/envs/vortx/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1525, in _parse_devices
    gpu_ids = device_parser.parse_gpu_ids(gpus)
  File "/home/darkayserleo/miniconda3/envs/vortx/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 89, in parse_gpu_ids
    return _sanitize_gpu_ids(gpus)
  File "/home/darkayserleo/miniconda3/envs/vortx/lib/python3.9/site-packages/pytorch_lightning/utilities/device_parser.py", line 151, in _sanitize_gpu_ids
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0]
 But your machine only has: []
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: 🚀 View run rare-gorge-24 at: https://wandb.ai/leonelos/vortx/runs/5w0pfpzf
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20230706_145359-5w0pfpzf/logs

still saying that I have no gpus detected

pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0]
 But your machine only has: []

nvidia-smi

Thu Jul  6 14:54:27 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:26:00.0 Off |                  N/A |
|  0%   53C    P8              18W / 170W |      9MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        Off | 00000000:27:00.0  On |                  N/A |
| 54%   53C    P8              21W / 170W |    502MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1903      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      1903      G   /usr/lib/xorg/Xorg                          170MiB |
|    1   N/A  N/A      2247      G   /usr/bin/gnome-shell                        138MiB |
|    1   N/A  N/A      5316      G   ...irefox/2356/usr/lib/firefox/firefox      180MiB |
+---------------------------------------------------------------------------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

What do I need to do to run your code? please

I'm feeling frustrated. I don't know what else I need to do