synxlin / deep-gradient-compression

[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
https://arxiv.org/pdf/1712.01887.pdf
Apache License 2.0
212 stars 45 forks source link

Dependency Issues #4

Open pepper-jk opened 2 years ago

pepper-jk commented 2 years ago

Hello,

I wanted to try out your code and came across an issue regarding pytorch dependencies.

I installed all the requirements in a fresh conda environment with python 3.7.11 via your requirements.txt. I made sure the versions are at least the ones listed in the readme.

I installed openmpi via: conda install openmpi

However, it appears the module torchpack.mtpack.

I also tried to go back from torch==1.9.1 to torch==1.5, but no change.

Hope you can help me. Thanks in advance.

$ python train.py
Extension horovod.torch has not been built: /home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
Traceback (most recent call last):
  File "train.py", line 15, in <module>
    from torchpack.mtpack.utils.config import Config, configs
ModuleNotFoundError: No module named 'torchpack.mtpack'

p.s. I will try this again tomorrow and update this issue if I find a solution.

pepper-jk commented 2 years ago

I figured it out.

I had to init the submodule https://github.com/synxlin/mini-torchpack.

$ git submodule init
$ git submodule update

Also installing the additional requirement torchvision>=0.4 for the submodule and the missing requirement six.

$ pip install torchvision
$ pip install six

However, it still is not running.

There is some issue with horovod. It seems like openmpi needs to be installed first. I uninstalled horovod and reinstalled it with the suggested parameters below, but it still produces the same error.

Also got some pytorch version mix up. I think the submodule requires a cuda version from what I can tell. I'm on a machine without GPUs here though, so this might be a problem later.

I'll keep at it though and post my updates here.

python train.py --devices cpu  
Extension horovod.torch has not been built: /home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
Traceback (most recent call last):
  File "/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/mpi_ops.py", line 33, in <module>
    from horovod.torch import mpi_lib_v2 as mpi_lib
ImportError: cannot import name 'mpi_lib_v2' from 'horovod.torch' (/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/__init__.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 17, in <module>
    from dgc.horovod.optimizer import DistributedOptimizer
  File "/home/pepper-jk/code/deep-gradient-compression/dgc/horovod/__init__.py", line 2, in <module>
    from dgc.horovod.optimizer import DistributedOptimizer
  File "/home/pepper-jk/code/deep-gradient-compression/dgc/horovod/optimizer.py", line 24, in <module>
    from horovod.torch.mpi_ops import allreduce_async_
  File "/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/mpi_ops.py", line 35, in <module>
    check_installed_version('pytorch', torch.__version__, e)
  File "/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/common/util.py", line 260, in check_installed_version
    raise HorovodVersionMismatchError(name, version, installed_version) from exception
horovod.common.exceptions.HorovodVersionMismatchError: Framework pytorch installed with version None but found version 1.10.0+cu102.
             This can result in unexpected behavior including runtime errors.
             Reinstall Horovod using `pip install --no-cache-dir` to build with the new version.