taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.35k stars 2.27k forks source link

Possibly incorrect gradients from autodiff #8356

Open chenzhekl opened 11 months ago

chenzhekl commented 11 months ago

Describe the bug A clear and concise description of what the bug is, ideally within 20 words.

The same algorithm of different forms produces different gradients.

To Reproduce Please post a minimal sample code to reproduce the bug. The developer team will put a higher priority on bugs that can be reproduced within 20 lines of code. If you want a prompt reply, please keep the sample code short and representative.

import taichi as ti
import torch

ti.init(arch=ti.cuda)

@ti.kernel
def foo(x: ti.types.ndarray(), y: ti.types.ndarray()):
    for i in x:
        # a = 0.0
        # for j in y:
        #     a += y[j]
        # x[i] += a
        for j in y:
            x[i] += y[j]

x = torch.tensor(
    [0, 0, 0, 0, 0], dtype=torch.float32, device="cuda", requires_grad=True
)
y = torch.tensor([1, 2, 3], dtype=torch.float32, device="cuda", requires_grad=True)

foo(x, y)
x.grad = torch.ones_like(x)
foo.grad(x, y)
print(x.grad, y.grad)

The above code outputs

[Taichi] version 1.7.0, llvm 15.0.4, commit aa0619fb, linux, python 3.10.12
[Taichi] Starting on arch=cuda
tensor([1., 1., 1., 1., 1.], device='cuda:0') tensor([1., 1., 1.], device='cuda:0')

while the commented-out code, which does the same thing, outputs:

[Taichi] version 1.7.0, llvm 15.0.4, commit aa0619fb, linux, python 3.10.12
[Taichi] Starting on arch=cuda
tensor([1., 1., 1., 1., 1.], device='cuda:0') tensor([5., 5., 5.], device='cuda:0')

Log/Screenshots Please post the full log of the program (instead of just a few lines around the error message, unless the log is > 1000 lines). This will help us diagnose what's happening. For example:

$ python my_sample_code.py
[Taichi] mode=release
[Taichi] version 0.6.29, llvm 10.0.0, commit b63f6663, linux, python 3.8.3
...

Additional comments If possible, please also consider attaching the output of command ti diagnose. This produces the detailed environment information and hopefully helps us diagnose faster.

If you have local commits (e.g. compile fixes before you reproduce the bug), please make sure you first make a PR to fix the build errors and then report the bug.

jim19930609 commented 11 months ago

Ok the problem is that the Taichi compiler does not reject use of "struct-for" in an inner loop -- the line of for j in y in the code.

Write something like:

@ti.kernel
def foo(x: ti.types.ndarray(), y: ti.types.ndarray()):
    for i in x:
        for j in range(y.shape[0]): 
            x[i] += y[j]

Meanwhile, we should add a guard to emit an error message when struct-for is used inside the loop