Closed gth828r closed 5 years ago
Are you using PyTorch with CUDA enabled?
As far as I know, yes. I am running this within the docker container. I don't think I've done anything to change pytorch from whatever was already in the container. When I ran the image_classification experiments, they were definitely running on the GPU, which I assume would confirm that the PyTorch with CUDA configuration is at the very least installed and available.
Interesting -- I will dig a bit more tomorrow to try to reproduce this.
One more thing to try: can you try installing seq2seq
in the runtime
directory, and try running an already partitioned GNMT model?
Good suggestion! I just tried it, but I unfortunately got the same error:
# python main_with_runtime.py --data_dir /mnt/wmt16 --distributed_backend gloo -m models.gnmt.gpus=4 --epochs 1 -b 128 -v 1 --master_addr 127.0.0.1 --config_path models/gnml/gpus=4/hybrid_conf.json --rank 0 --local_rank 0
THCudaCheck FAIL file=/opt/pytorch/pytorch/aten/src/THC/generic/THCTensorMath.cu line=35 error=209 : no kernel image is available for execution on the device
Traceback (most recent call last):
File "main_with_runtime.py", line 580, in <module>
main()
File "main_with_runtime.py", line 177, in main
output_tensors = stage(*tuple(input_tensors))
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/pipedream/runtime/translation/models/gnmt/gpus=4/stage0.py", line 20, in forward
out5 = self.layer5(out4, out1)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/pipedream/runtime/translation/seq2seq/models/encoder.py", line 64, in forward
return self.emu_bidir_lstm(self.layer1, self.layer2, input, lengths)
File "/workspace/pipedream/runtime/translation/seq2seq/models/encoder.py", line 53, in emu_bidir_lstm
out1 = model1(inputl1)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 507, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 556, in forward
return self.forward_tensor(input, hx)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 536, in forward_tensor
output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 509, in forward_impl
dtype=input.dtype, device=input.device)
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /opt/pytorch/pytorch/aten/src/THC/generic/THCTensorMath.cu:35
I'll be trying some things out and digging this afternoon. I'll report back if I find anything.
My best guess is that the problem is hardware specific. We are running on a 4 GPU machine that only has GeForce GTX 1080 Ti GPUs in it, and after doing some research I am pretty sure that the GPU feature list targeted in the setup.py script (seems to require Volta support) is beyond our GPUs. Presumably we just cannot run the gnmt experiment on the hardware we have. If this doesn't sound right to you, feel free to keep poking at it. Otherwise, I think we can close this one.
What's the highest compute capability that your GPU supports?
I believe it supports up to sm_61, and the main missing piece as compared to sm_70 as I understand it is tensor core support.
Hmm, it seems like this won't work then unfortunately.
Sorry for the delay, and thanks for confirming. I'll close this out.
I am running the translation profiler as follows:
When running, I encounter the following error when the first training epoch starts:
Do you have any ideas what might be causing this? It isn't clear to me whether or not this is a software/environment issue, or an issue with my specific hardware.