Open pepper-jk opened 2 years ago
I figured it out.
I had to init the submodule https://github.com/synxlin/mini-torchpack.
$ git submodule init
$ git submodule update
Also installing the additional requirement torchvision>=0.4
for the submodule and the missing requirement six
.
$ pip install torchvision
$ pip install six
However, it still is not running.
There is some issue with horovod
. It seems like openmpi
needs to be installed first. I uninstalled horovod
and reinstalled it with the suggested parameters below, but it still produces the same error.
Also got some pytorch version mix up. I think the submodule requires a cuda version from what I can tell. I'm on a machine without GPUs here though, so this might be a problem later.
I'll keep at it though and post my updates here.
python train.py --devices cpu
Extension horovod.torch has not been built: /home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
Traceback (most recent call last):
File "/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/mpi_ops.py", line 33, in <module>
from horovod.torch import mpi_lib_v2 as mpi_lib
ImportError: cannot import name 'mpi_lib_v2' from 'horovod.torch' (/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/__init__.py)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 17, in <module>
from dgc.horovod.optimizer import DistributedOptimizer
File "/home/pepper-jk/code/deep-gradient-compression/dgc/horovod/__init__.py", line 2, in <module>
from dgc.horovod.optimizer import DistributedOptimizer
File "/home/pepper-jk/code/deep-gradient-compression/dgc/horovod/optimizer.py", line 24, in <module>
from horovod.torch.mpi_ops import allreduce_async_
File "/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/torch/mpi_ops.py", line 35, in <module>
check_installed_version('pytorch', torch.__version__, e)
File "/home/pepper-jk/.conda/envs/deep_comp/lib/python3.7/site-packages/horovod/common/util.py", line 260, in check_installed_version
raise HorovodVersionMismatchError(name, version, installed_version) from exception
horovod.common.exceptions.HorovodVersionMismatchError: Framework pytorch installed with version None but found version 1.10.0+cu102.
This can result in unexpected behavior including runtime errors.
Reinstall Horovod using `pip install --no-cache-dir` to build with the new version.
Hello,
I wanted to try out your code and came across an issue regarding pytorch dependencies.
I installed all the requirements in a fresh conda environment with
python 3.7.11
via yourrequirements.txt
. I made sure the versions are at least the ones listed in the readme.I installed openmpi via:
conda install openmpi
However, it appears the module
torchpack.mtpack
.I also tried to go back from torch==1.9.1 to torch==1.5, but no change.
Hope you can help me. Thanks in advance.
p.s. I will try this again tomorrow and update this issue if I find a solution.