sands-lab / grace

GRACE - GRAdient ComprEssion for distributed deep learning
https://sands.kaust.edu.sa/project/grace/
BSD 2-Clause "Simplified" License
133 stars 45 forks source link

ImportError:Extension horovod.torch has not been built #1

Closed KevvinHoo closed 4 years ago

KevvinHoo commented 4 years ago
from horovod.tensorflow import allreduce_async_, synchronize

The program runs at the line above break off. The error info as below:

Traceback (most recent call last):
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/torch/__init__.py", line 32, in <module>
    __file__, 'mpi_lib_v2')
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/torch/__init__.py", line 35, in <module>
    __file__, 'mpi_lib', '_mpi_lib')
  File "/GPUFS/nudt_chkwu_2/kfhu/horovod-0.19.2/horovod/common/util.py", line 56, in check_extension
    'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.torch has not been built.  If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.

Can you give me a resolution? Appreciate for your help!

hangxu0304 commented 4 years ago

Hi,

Thanks for your interest. Currently Horovod 0.19.2 is not supported in our framework. Please follow the instruction to install the environment. Let us know if you encounter any issues.

Hang

KevvinHoo commented 4 years ago

I have installed Horovod 0.18.2. When I run the example program named pytorch_mnist.py, there are some errors as below:

Traceback (most recent call last):
  File "./horovod_mnist.py", line 189, in <module>
    train(epoch)
  File "./horovod_mnist.py", line 67, in train
    loss.backward()
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 141, in hook
    handle, ctx = self._allreduce_grad_async(p)
  File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 122, in _allreduce_grad_async
    tensor_compressed, ctx = self._compression.compress(tensor)
AttributeError: 'Allgather' object has no attribute 'compress'

I find the class Allgather have a atrribute 'compress'. The example program used the TopkCompressor. I don't know how to solve this problem. Could you give me a advise? Thanks for your help.

hangxu0304 commented 4 years ago

It seems like you haven't apply the horovod patch we provided. Before applying the patch, please make sure you can run an official horovod example following the horovod guidelines .
We do find a bug in pytorch_mnist.py, see here. Please update your training script by this version.

Hang

KevvinHoo commented 4 years ago

There is no problem when I used compression = hvf.Compression.fp16 if args.fp16 allreduce else hvd.Compression.none, which used in the example program supported by Horovod, instead of grc = Allgather(TopKCompressor(0.3), ResidualMemory(), hvd.size()). That bug you mentioned above already be fixed before running the training script. It is the truth that I haven't apply the patch, as I don't know how to make it. Could you tell me the details about this patch?

Best regards

KevvinHoo commented 4 years ago

Never mind - I have read the file named horovd 0.18.2-patch. Maybe I know how to apply this patch. The - indicate to delete and the + is to add, right? And is it needed to make a change to the module file about the TensorFlow when I used the PyTorch framework?

hangxu0304 commented 4 years ago

No need. Just modify the related pytorch files.

在 2020年5月28日,16:12,Tonyhukaifan notifications@github.com 写道:

 Never mind - I have read the file named horovd 0.18.2-patch. Maybe I know how to apply this patch. The - indicate to delete and the + is to add, right? And is it needed to make a change to the module file about the TensorFlow when I used the PyTorch framework?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

--

This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.