swiftlang / swift

The Swift Programming Language
https://swift.org
Apache License 2.0
67.69k stars 10.39k forks source link

[AutoDiff] Replicated tangent vector within generated pullback for mutating function with control flow #61773

Open BradLarson opened 2 years ago

BradLarson commented 2 years ago

Describe the bug In differentiable mutating functions (although this may also extend to ones with inouts), adding control flow appears to lead to replicated tangent vectors in the generated pullback. This can be seen in the generated SIL, where replicated alloc_stacks for the same tangent vector appear in the first basic block of the pullback, one after the other, and then series of redundant store, copy_addr, and the like show up in the subsequent basic blocks.

An example that illustrates this is the following single-file reproducer:

import _Differentiation

struct Test: Differentiable {
    var val1: Float
    var val2: Float

    @differentiable(reverse)
    mutating func doSomething(input: Float) {
        self.val1 *= input
        self.val2 *= input

        if self.val1 > input {
            self.val1 = input
        }
        if self.val2 > input {
            self.val2 = input
        }
    }
}

@differentiable(reverse)
func wrapper(input: Float, multiplier: Float) -> Float {
    var test = Test(val1: input, val2: input)
    test.doSomething(input: multiplier)
    return test.val1 * test.val2
}

let grad = gradient(at: 2.0, 3.0, of: wrapper)
print("Grad: \(grad)")

where if you build and -emit-sil, then look at the pullback of Test.doSomething(input:), you'll see something like the following in bb0:

  %2 = alloc_stack $Test.TangentVector, var, name "self", argno 2, implicit // users: %170, %169, %168, %167, %162, %159, %156, %153, %6, %206, %195, %187, %180
  %3 = metatype $@thin Test.TangentVector.Type    // user: %5
  // function_ref static Test.TangentVector.zero.getter
  %4 = function_ref @$s10TestModule0A0V13TangentVectorV4zeroAEvgZ : $@convention(method) (@thin Test.TangentVector.Type) -> Test.TangentVector // user: %5
  %5 = apply %4(%3) : $@convention(method) (@thin Test.TangentVector.Type) -> Test.TangentVector // user: %6
  store %5 to %2 : $*Test.TangentVector           // id: %6
  %7 = tuple ()
  %8 = alloc_stack $Test.TangentVector, var, name "self", argno 2, implicit // users: %136, %12, %205, %204, %134, %133, %132
  %9 = metatype $@thin Test.TangentVector.Type    // user: %11
  // function_ref static Test.TangentVector.zero.getter
  %10 = function_ref @$s10TestModule0A0V13TangentVectorV4zeroAEvgZ : $@convention(method) (@thin Test.TangentVector.Type) -> Test.TangentVector // user: %11
  %11 = apply %10(%9) : $@convention(method) (@thin Test.TangentVector.Type) -> Test.TangentVector // user: %12
  store %11 to %8 : $*Test.TangentVector          // id: %12
  %13 = tuple ()
  %14 = alloc_stack $Test.TangentVector, var, name "self", argno 2, implicit // users: %130, %18, %203, %202, %140, %118, %117, %116, %115
  %15 = metatype $@thin Test.TangentVector.Type   // user: %17
  // function_ref static Test.TangentVector.zero.getter
  %16 = function_ref @$s10TestModule0A0V13TangentVectorV4zeroAEvgZ : $@convention(method) (@thin Test.TangentVector.Type) -> Test.TangentVector // user: %17
  %17 = apply %16(%15) : $@convention(method) (@thin Test.TangentVector.Type) -> Test.TangentVector // user: %18
  store %17 to %14 : $*Test.TangentVector         // id: %18

entries like the following in bb2:

  store %67 to %20 : $*Test.TangentVector         // id: %101
  store %67 to %20 : $*Test.TangentVector         // id: %102
  store %67 to %20 : $*Test.TangentVector         // id: %103
  store %67 to %20 : $*Test.TangentVector         // id: %104
  store %67 to %20 : $*Test.TangentVector         // id: %105

and so on. The number of replicated tangent vectors and related instructions scales with the number of instances of control flow in the original function. With large tangent vectors and many if() statements, this can generate code that causes LLVM to exhaust all available memory when further handling the resulting SIL.

Resulting SIL has been attached to this issue for the above Swift example.

This is not a recent regression, and has been present for at least the last year.

Steps To Reproduce Steps to reproduce the behavior:

  1. Place the above single-file reproducer in test.swift.
  2. Emit SIL via swiftc -Onone -emit-sil test.swift -o test.sil
  3. Examine test.sil.

Expected behavior There should be a single instance of the tangent vector for self, no redundant operations on it, and adding more control flow to the original function should not cause a steady increase in tangent vector copies and replicated instructions.

Environment (please fill out the following information)

Additional resources

An example of the generated SIL is attached as a file. test-sil.txt

asl commented 1 year ago

Ok, one of the problem are projections. Essentially, read / write access of struct fields yields with a begin_access / end_access bracket. Adjoint buffer of projections are projections of adjoint buffer. However, we do not do any CSE. As a result, we're ending with propagating the same value (that is being projected several times) over and over again.

tbkka commented 1 year ago

I see you specifically included -Onone, which disables most SIL optimizations. Do you see the same with -O or without the -Onone?

asl commented 1 year ago

@tbkka If I correctly remember @BradLarson's case, the problem was the excessive memory consumption and compilation time. So, either swift-level optimizations or LLVM-level were able to remove extraneous copies, however, only in the case where compilation succeeded. And in some cases it did not due to OOM.