taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.36k stars 2.27k forks source link

Erroneous zero gradients #1966

Open samuela opened 3 years ago

samuela commented 3 years ago

Describe the bug I've found that it's possible to get Taichi to return incorrect gradients with certain needs_grad settings.

To Reproduce

import numpy as np
import taichi as ti

np.random.seed(123)
real = ti.f32
ti.init(default_fp=real)

loss = ti.field(dtype=real, shape=(), needs_grad=True)
gravitation = ti.field(dtype=real, shape=(), needs_grad=True)
x = ti.Vector.field(2, dtype=real, shape=(), needs_grad=True)

# If needs_grad is False on `acc`, then we get erroneous zero gradients, as
# opposed to a missing needs_grad error. I never actually ask for `acc.grad`, so
# this must be some kind of internal AD issue.
acc = ti.Vector.field(2, dtype=real, shape=(), needs_grad=False)
# OTOH, whether or not `acc_bar` is marked as needs_grad has no impact on the
# gradients.
acc_bar = ti.Vector.field(2, dtype=real, shape=())

@ti.kernel
def forces():
    len_r = ti.max(x.norm(), 1e-1)
    acc[None] = gravitation[None] / (len_r * len_r * len_r) * x

@ti.kernel
def vjp():
    loss[None] = acc[None].dot(acc_bar[None])

x.from_numpy(np.random.randn(2).astype(np.float32))
gravitation.from_numpy(np.array(np.random.randn()).astype(np.float32))
acc_bar.from_numpy(np.random.randn(2).astype(np.float32))

with ti.Tape(loss):
    forces()
    vjp()

# These should be non-zero!
print(x.grad.to_numpy())
print(gravitation.grad.to_numpy())

Log/Screenshots

(venv) samuela@n64:~/dev/research/julia/odecontrol$ python difftaichi/electric/poop.py 
[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-kl7gniy8
[Taichi] version 0.6.40, llvm 10.0.0, commit 45f52a90, linux, python 3.6.9
[Taichi] Starting on arch=x64
[Taichi] materializing...
[0. 0.]
0.0

even though the correct gradients are non-zero.

Additional comments

(venv) samuela@n64:~/dev/research/julia/odecontrol$ ti diagnose
[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-3yw6dnie
[Taichi] version 0.6.40, llvm 10.0.0, commit 45f52a90, linux, python 3.6.9

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://taichi.rtfd.io/en/stable
GitHub: https://github.com/taichi-dev/taichi
Forum:  https://forum.taichi.graphics

Taichi system diagnose:

python: 3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0]
system: linux
executable: /home/samuela/dev/difftaichi/venv/bin/python
platform: Linux-5.3.0-51-generic-x86_64-with-Ubuntu-18.04-bionic
architecture: 64bit ELF
uname: Linux n64 5.3.0-51-generic #44~18.04.2-Ubuntu SMP Thu Apr 23 14:27:18 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

locale: en_US.UTF-8
PATH: /home/samuela/dev/difftaichi/venv/bin:/home/samuela/.vscode-server-insiders/bin/89c002ab02f87102d91efc83c191ef1174756c6a/bin:/home/samuela/julia-1.5.1/bin:/home/samuela/.local/bin:/home/samuela/.vscode-server-insiders/bin/89c002ab02f87102d91efc83c191ef1174756c6a/bin:/home/samuela/julia-1.5.1/bin:/home/samuela/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

TAICHI_REPO_DIR=

import: <module 'taichi' from '/home/samuela/dev/difftaichi/venv/lib/python3.6/site-packages/taichi/__init__.py'>

cc: True
cpu: True
metal: False
opengl: False
cuda: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo': 'glewinfo'

Fri Oct 16 18:56:34 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K4000        Off  | 00000000:49:00.0 Off |                  N/A |
| 30%   29C    P8    10W /  87W |     26MiB /  3015MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2717      G   /usr/lib/xorg/Xorg                             8MiB |
|    0      2749      G   /usr/bin/gnome-shell                           3MiB |
+-----------------------------------------------------------------------------+

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-rg6yxj55
[Taichi] version 0.6.40, llvm 10.0.0, commit 45f52a90, linux, python 3.6.9

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-q59prmfs
[Taichi] version 0.6.40, llvm 10.0.0, commit 45f52a90, linux, python 3.6.9
[Taichi] Starting on arch=x64

[W 10/16/20 18:56:35.118] [__init__.py:adaptive_arch_select@558] Arch=[Arch.opengl] is not supported, falling back to CPU
[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-xkd0gc2q
[Taichi] version 0.6.40, llvm 10.0.0, commit 45f52a90, linux, python 3.6.9
[Taichi] Starting on arch=x64

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-_p8g6nti
[Taichi] version 0.6.40, llvm 10.0.0, commit 45f52a90, linux, python 3.6.9
[Taichi] Starting on arch=cuda

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-8qvrv_0e
[Taichi] version 0.6.40, llvm 10.0.0, commit 45f52a90, linux, python 3.6.9

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://taichi.rtfd.io/en/stable
GitHub: https://github.com/taichi-dev/taichi
Forum:  https://forum.taichi.graphics

Running example minimal ...
[Taichi] Starting on arch=x64
[Taichi] materializing...
>>> Running time: 0.16s
42

Consider attaching this log when maintainers ask about system information.
>>> Running time: 2.98s
(venv) samuela@n64:~/dev/research/julia/odecontrol$ 
k-ye commented 3 years ago

Thanks! I believe it's because Taichi's Autodiff pass has swallowed this silently... See https://github.com/taichi-dev/taichi/blob/4a568522572fbaaa8ef9c7e594553361e329aaa0/taichi/transforms/auto_diff.cpp#L654

We could check failure here, though I'm not sure if there's any other consideration not to do so.

k-ye commented 3 years ago

Adding an even simpler reproducer:

import numpy as np
import taichi as ti

np.random.seed(123)
real = ti.f32
ti.init(ti.cpu, default_fp=real, print_ir=True)

loss = ti.field(dtype=real, shape=(), needs_grad=True)
x = ti.field(dtype=real, shape=(), needs_grad=True)
acc = ti.field(dtype=real, shape=(), needs_grad=False)

@ti.kernel
def forces():
    acc[None] = x[None]

@ti.kernel
def vjp():
    loss[None] = acc[None]

x[None] = np.random.randn(1).astype(np.float32)[0]

with ti.Tape(loss):
    forces()
    vjp()

# These should be non-zero!
print(x.grad.to_numpy())

Because acc's grad is not instantiated, eventually kernel vjp_c7_0_grad_grad gets optimized to empty...

[I 10/24/20 12:55:09.690] [compile_to_offloads.cpp:operator()@18] [vjp_c7_0_grad_grad] Initial IR:
kernel {
  #@tmp0[] = gbl load #@tmp4[]
}
...
[I 10/24/20 12:55:09.691] [compile_to_offloads.cpp:operator()@18] [vjp_c7_0_grad_grad] Simplified IV:
kernel {
}