Closed KevvinHoo closed 4 years ago
Hi,
Thanks for your interest. Currently Horovod 0.19.2 is not supported in our framework. Please follow the instruction to install the environment. Let us know if you encounter any issues.
Hang
I have installed Horovod 0.18.2. When I run the example program named pytorch_mnist.py, there are some errors as below:
Traceback (most recent call last):
File "./horovod_mnist.py", line 189, in <module>
train(epoch)
File "./horovod_mnist.py", line 67, in train
loss.backward()
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 141, in hook
handle, ctx = self._allreduce_grad_async(p)
File "/GPUFS/nudt_chkwu_2/.conda/envs/torch_ho/lib/python3.6/site-packages/horovod/torch/__init__.py", line 122, in _allreduce_grad_async
tensor_compressed, ctx = self._compression.compress(tensor)
AttributeError: 'Allgather' object has no attribute 'compress'
I find the class Allgather have a atrribute 'compress'. The example program used the TopkCompressor. I don't know how to solve this problem. Could you give me a advise? Thanks for your help.
It seems like you haven't apply the horovod patch we provided. Before applying the patch, please make sure you can run an official horovod example following the horovod guidelines .
We do find a bug in pytorch_mnist.py
, see here. Please update your training script by this version.
Hang
There is no problem when I used compression = hvf.Compression.fp16 if args.fp16 allreduce else hvd.Compression.none
, which used in the example program supported by Horovod, instead of grc = Allgather(TopKCompressor(0.3), ResidualMemory(), hvd.size())
. That bug you mentioned above already be fixed before running the training script.
It is the truth that I haven't apply the patch, as I don't know how to make it. Could you tell me the details about this patch?
Best regards
Never mind - I have read the file named horovd 0.18.2-patch. Maybe I know how to apply this patch. The -
indicate to delete and the +
is to add, right? And is it needed to make a change to the module file about the TensorFlow when I used the PyTorch framework?
No need. Just modify the related pytorch files.
在 2020年5月28日,16:12,Tonyhukaifan notifications@github.com 写道:
Never mind - I have read the file named horovd 0.18.2-patch. Maybe I know how to apply this patch. The - indicate to delete and the + is to add, right? And is it needed to make a change to the module file about the TensorFlow when I used the PyTorch framework?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
--
This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email.
The program runs at the line above break off. The error info as below:
Can you give me a resolution? Appreciate for your help!