swiftlang / swift

The Swift Programming Language
https://swift.org
Apache License 2.0
67.49k stars 10.35k forks source link

[SR-8589] [blocker] `model.updateParameters` in the iris tutorial crashes the compiler #51107

Closed 83b09dda-53f5-4663-b44c-87fd2a9a8717 closed 6 years ago

83b09dda-53f5-4663-b44c-87fd2a9a8717 commented 6 years ago
Previous ID SR-8589
Radar None
Original Reporter @marcrasi
Type Bug
Status Resolved
Resolution Done
Additional Detail from JIRA | | | |------------------|-----------------| |Votes | 0 | |Component/s | Swift for TensorFlow | |Labels | Bug, s4tf-iris-tutorial | |Assignee | bgogul (JIRA) | |Priority | Medium | md5: cffdda1ec2c8949bd9538c7f7ce2d691

Issue Description:

Execute the following two cells in Jupyter (or paste them into an LLDB REPL):

import TensorFlow
struct Model : Parameterized {
    @TFParameter var x: Tensor<Float> = Tensor(zeros: [10, 10])
}
var model = Model()
model.updateParameters(withGradients: Model.Parameters(x: Tensor(1))) { $0 -= 0.1 * $1 }

You should get this error:

!!! Compiler bug -- Tensor op builtin __tfop_tfc.scalarToTensor,$in cannot be lowered to LLVM IR !!!

If you do everything in one cell, or if you do everything in one swift script, then you won't get the error!

rxwei commented 6 years ago

Do you have a minimal reproducer?

83b09dda-53f5-4663-b44c-87fd2a9a8717 commented 6 years ago

I created one and updated the description. It seems to be some interaction with LLDB because I can't get it to cause an error in a plain swift script.

rxwei commented 6 years ago

The problem does not occur in compiler mode or script interpreter mode. This only occurs in REPL.

rxwei commented 6 years ago

Same with

model.allParameters.update(withGradients: Model.Parameters(x: Tensor(1))) { $0 -= 0.1 * $1 }
rxwei commented 6 years ago

This bug clearly has the highest priority because parameter aggregate is a strict requirement. bgogul (JIRA User) Would you like to take a look? I can provide further support on this.

The first step is to figure out why the trailing closure

{ $0 -= 0.1 * $1 }

... is not getting partitioned.

One quick simple hack is to make the compiler partition even inlinable functions as long as they are not generic. This change can be added at TFUtilities.cpp:662.

swift-ci commented 6 years ago

Comment by Gogul Balakrishnan (JIRA)

It looks like we have a workaround for the time being. I will take a look once I wrap up the other bugs that are blocking the tutorial.

rxwei commented 6 years ago

There's no workaround yet.

If you do everything in one cell, or if you do everything in one swift script, then you won't get the error!

Putting everyone in one cell would make the tutorial notebook no longer a notebook.

ed4ae5fb-5ade-43c3-bf64-26e47fc0876d commented 6 years ago

Does this work if you hard code "-tf-promote-global-variables" to enabled?

swift-ci commented 6 years ago

Comment by Gogul Balakrishnan (JIRA)

@lattner, enabling tf-promote-global-variables did not fix the issue. I will dig around a bit more.

swift-ci commented 6 years ago

Comment by Gogul Balakrishnan (JIRA)

Just an update based on my investigation. In the repl mode, the compiler fails to inline the call to updateParameters:

model.updateParameters(withGradients: Model.Parameters(x: Tensor(1))) { $0 -= 0.1 * $1 }

The reason is that the attempt to link the updateParameters fails. The relevant condition in the shouldInlinePredicate is the following:

if (callee.empty() && !site.getModule().linkFunction(&callee, SILModule::LinkingMode::LinkAll)  

"site.getModule().linkFunction(...)" returns false for updateParameters in repl mode.

rxwei commented 6 years ago

How about the following?

model.allParameters.update(withGradients: Model.Parameters(x: Tensor(1))) { $0 -= 0.1 * $1 }

Is this causing the same error?

This case above should be tested first. `updateParameters` is just a Swift-level wrapper that might obscure what's going on.

swift-ci commented 6 years ago

Comment by Gogul Balakrishnan (JIRA)

Yes, changing it to allParameters.update(..) is causing the same error.

I think I understand the issue now. Basically, the SIL IR is not retained after evaluating a line in repl mode. Only the lowered LLVM IR persists. That is why the deabstraction phase is not able to link the SILFunction in.

rxwei commented 6 years ago

I don't think that's the case, here's an example:

  2> func foo<T>(x: Tensor<T>) -> Tensor<T> {
  3.     return x
  4. } // I hit enter here
  5> foo(x: Tensor(1))
2018-08-23 16:33:19.642759: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
$R0: TensorFlow.Tensor<Double> = 1.0

This means the SIL is being preserved. Compiler cannot partition generic functions.

swift-ci commented 6 years ago

Comment by Gogul Balakrishnan (JIRA)

I did a bit more digging around. I don't completely understand how REPL is implemented. However, I can see that the SILModule is thrown away after generating the LLVM IR. For instance, SwiftExpressionParser.cpp:1887 calls swift:: PerformIRGeneration(), which ultimately releases the SILModule at IRGen.cpp:854.

If this is the case, we need to have a way of preserving the relevant SIL code throughout the REPL session.

swift-ci commented 6 years ago

Comment by Gogul Balakrishnan (JIRA)

I sent out a hacky patch-19502 to deal with this issue for the time being.

@rxwei, could it be the case that the generic case is working because foo has not #tfops, and hence, no deabstraction/partitioning is needed? Here is the SIL that was generated for foo in lldb:

// Foo<A>(x:)
sil @$S13__lldb_expr_33Foo1x10TensorFlow0D0VyxGAG_tAD013AccelerableBydE0RzlF : $@convention(thin) <T where T : AccelerableByTensorFlow> (@guaranteed Tensor<T>) -> @owned Tensor<T> {
// %0                                             // users: %2, %1
bb0(%0 : @guaranteed $Tensor<T>):
  debug_value %0 : $Tensor<T>, let, name "x", argno 1 // id: %1
  %2 = copy_value %0 : $Tensor<T>                 // user: %3
  return %2 : $Tensor<T>                          // id: %3
} // end sil function '$S13__lldb_expr_33Foo1x10TensorFlow0D0VyxGAG_tAD013AccelerableBydE0RzlF' 
rxwei commented 6 years ago

Function not having tensor is a good point. I intended to add some tensor code there but forgot...

The following case is already being diagnosed, which is great.

  4> func bar<T : Numeric>(x: Tensor<T>) -> Tensor<T> { return x + x }
error: repl.swift:4:6: error: internal error: TensorFlow graph program extraction does not work on generic functions yet
func bar<T : Numeric>(x: Tensor<T>) -> Tensor<T> { return x + x }
     ^

And the following case fails as expected:

  4> @inlinable func bar<T : Numeric>(x: Tensor<T>) -> Tensor<T> { return x + x }
  5> bar(x: Tensor(1))

Your patch makes sense and is a nice hack to unblock progress!