Can your_first_program be trained on a GPU?

I've tried to train your_first_program on a GPU by uncommenting static_module_wrapper lines like so:

q_net = static_module_wrapper(q_net, "cuda", "cuda")
q_net_t = static_module_wrapper(q_net_t, "cuda", "cuda")

and got an error:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

a combination

q_net = static_module_wrapper(q_net,  "cuda", "cpu")
q_net_t = static_module_wrapper(q_net_t,  "cuda", "cpu")

fails too. Following setting runs, but doesn't use GPU:

    q_net = static_module_wrapper(q_net, "cpu", "cuda")
    q_net_t = static_module_wrapper(q_net_t, "cpu", "cuda")

What shall I do to train your_first_program on a GPU? Here is requirements.txt:

absl-py==0.10.0
astor==0.8.1
astunparse==1.6.3
backcall==0.2.0
brotlipy==0.7.0
cachetools==4.1.1
certifi==2020.6.20
cffi==1.13.2
chardet==3.0.4
cloudpickle==1.6.0
colorlog==4.4.0
cryptography @ file:///tmp/build/80754af9/cryptography_1601046817403/work
cycler==0.10.0
decorator==4.4.2
dill==0.3.2
dm-reverb==0.1.0
dm-tree==0.1.5
EasyProcess==0.3
future==0.18.2
gast==0.3.3
gin-config==0.3.0
google-auth==1.22.1
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
GPUtil==1.4.0
graphviz==0.14.2
grpcio==1.33.1
gym==0.17.3
h5py @ file:///tmp/build/80754af9/h5py_1593454119955/work
idna==2.10
imageio==2.9.0
imageio-ffmpeg==0.4.2
importlib-metadata==2.0.0
install==1.3.4
ipython @ file:///tmp/build/80754af9/ipython_1598883837425/work
ipython-genutils==0.2.0
jedi @ file:///tmp/build/80754af9/jedi_1596490743326/work
Keras-Applications @ file:///tmp/build/80754af9/keras-applications_1594366238411/work
Keras-Preprocessing==1.1.2
kiwisolver==1.2.0
machin==0.3.4
Markdown==3.3.2
matplotlib==3.3.2
mkl-fft==1.2.0
mkl-random==1.1.1
mkl-service==2.3.0
moviepy==1.0.3
numpy==1.18.5
oauthlib==3.1.0
opt-einsum==3.3.0
pandas @ file:///tmp/build/80754af9/pandas_1602088128026/work
parso==0.7.0
pexpect @ file:///tmp/build/80754af9/pexpect_1594383317248/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1594384075987/work
Pillow==8.0.1
portpicker==1.3.1
proglog==0.1.9
progressbar==2.5
prompt-toolkit @ file:///tmp/build/80754af9/prompt-toolkit_1602688806899/work
protobuf==3.13.0
psutil==5.7.2
ptyprocess==0.6.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.19
pyglet==1.5.0
Pygments @ file:///tmp/build/80754af9/pygments_1600458456400/work
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1594392929924/work
pyparsing==2.4.7
PySocks @ file:///tmp/build/80754af9/pysocks_1594394576006/work
python-dateutil==2.8.1
pytz==2020.1
PyVirtualDisplay @ file:///home/conda/feedstock_root/build_artifacts/pyvirtualdisplay_1602367622068/work
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.5.3
six==1.15.0
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorboardX==2.1
tensorflow==2.3.1
tensorflow-estimator==2.3.0
tensorflow-probability==0.11.1
termcolor==1.1.0
torch==1.6.0
torchvision==0.5.0
torchviz==0.0.1
tqdm==4.50.2
traitlets @ file:///tmp/build/80754af9/traitlets_1602787416690/work
urllib3==1.25.11
wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.3.1

Hello, the easiest solution is:

q_net = QNet(observe_dim, action_num).to("cuda:0")
q_net_t = QNet(observe_dim, action_num).to("cuda:0")

Explaination

    # let framework determine input/output device based on parameter location
    # a warning will be thrown.
    q_net = QNet(observe_dim, action_num)
    q_net_t = QNet(observe_dim, action_num)

    # to mark the input/output device Manually
    # will not work if you move your model to other devices
    # after wrapping

    # q_net = static_module_wrapper(q_net, "cpu", "cpu")
    # q_net_t = static_module_wrapper(q_net_t, "cpu", "cpu")

    # to mark the input/output device Automatically
    # will not work if you model locates on multiple devices

    # q_net = dynamic_module_wrapper(q_net)
    # q_net_t = dynamic_module_wrapper(q_net_t)

If your modules are not wrapped, the framework will try to determine your model's input&output location by its parameter location, but your model must satisfy input device = output device = parameter device(s) you can see this process at here
dynamic_module_wrapper is a class inheriting nn.Module from Pytorch, after wrapping your module inside this wrapper, you will be able to move your module around, however, it is using the same mechanism to determine your input/output location as not wrapping your model at all, therefore you must also make sure: input device = output device = parameter device(s):
```
q_net = dynamic_module_wrapper(q_net).to("cuda:0")
```
It is meant to be a serve as a reminder to tell you that a wrapper is needed, for your raw model, since you need to specify input/output whenever possible, otherwise, it would be hard to do things like auto-partition your model, etc.

static_module_wrapper is aimed at more complex models, with input from one device and output on another device, therefore, it is a one-time-for-all wrapper, as said in the document, you should only use this wrapper if you are not going to move your model around using .to(<device>), .cuda(), .cpu(), etc:

def static_module_wrapper(wrapped_module: nn.Module,
                      input_device: Union[str, t.device],
                      output_device: Union[str, t.device]):
"""
Wrapped module could locate on multiple devices, but must not be moved.
Input device and output device are statically specified by user.
"""
wrapped_module.input_device = input_device
wrapped_module.output_device = output_device
return wrapped_module

Feel free to reopen this if you have any questions.

iffiX / machin

Can your_first_program be trained on a GPU? #3

Explaination