Panic when loading a keras LSTM model in Go

rcostu commented 8 months ago

Hi,

Redirected this ticket from tfgo repository as the problem seems to be in tensorflow library but as it is related to keras, they redirected me here from issue #63824 in the TF project. The problem documented here happens exactly the same using either tfgo LoadModel function or tf LoadSavedModel function.

I have developed a model that makes a time series forecast receiving data from the last 60 days as floats and returns one float. I am training the model in Python and saving it with the export function as stated in the tf documentation for keras models.

Model definition in python looks like this:

self.model= Sequential()
self.model.add(LSTM(50,return_sequences=True, input_shape= (x_train.shape[1],1)))
self.model.add(LSTM(50,return_sequences= False))
self.model.add(Dense(25))
self.model.add(Dense(25))
self.model.add(Dense(1))

#compile the model
self.model.compile(optimizer='adam',loss='mean_squared_error')
self.model.fit(x_train, y_train, batch_size=1, epochs=1, verbose=self.config.VERBOSE)

self.model.export("/Users/rcostumero/Downloads/test_model_go")

The saved_model_cli show has this output:

saved_model_cli show --all --dir /Users/rcostumero//Downloads/test_model_go
2024-01-09 21:09:48.934198: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serve']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['lstm_2_input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 60, 1)
        name: serve_lstm_2_input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['output_0'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['lstm_2_input'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 60, 1)
        name: serving_default_lstm_2_input:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['output_0'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: StatefulPartitionedCall_1:0
  Method name is: tensorflow/serving/predict
The MetaGraph with tag set ['serve'] contains the following ops: {'DisableCopyOnRead', 'AssignVariableOp', 'Tanh', 'ShardedFilename', 'StridedSlice', 'MergeV2Checkpoints', 'Split', 'Mul', 'VarHandleOp', 'Placeholder', 'MatMul', 'BiasAdd', 'StatelessWhile', 'Const', 'Pack', 'Fill', 'Transpose', 'TensorListFromTensor', 'ReadVariableOp', 'Identity', 'TensorListStack', 'StaticRegexFullMatch', 'Select', 'RestoreV2', 'SaveV2', 'StringJoin', 'TensorListReserve', 'StatefulPartitionedCall', 'AddV2', 'Sigmoid', 'PartitionedCall', 'Shape', 'NoOp'}

Concrete Functions:
  Function Name: 'serve'
    Option #1
      Callable with:
        Argument #1
          lstm_2_input: TensorSpec(shape=(None, 60, 1), dtype=tf.float32, name='lstm_2_input')

And the saved_model_cli run with this command is running as expected.

saved_model_cli run --dir /Users/rcostumero//Downloads/test_model_go --tag_set serve --signature_def serve --input_exprs='lstm_2_input=[[[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1],[1]]]'

However, when running the go code as in the example given it panics when loading the model. The code looks like this:

package main

import (
    "fmt"

    tf "github.com/galeone/tensorflow/tensorflow/go"
    tg "github.com/galeone/tfgo"
)

func main() {
    fmt.Println("starting")

    model := tg.LoadModel("/Users/rcostumero/Downloads/test_model_go", []string{"serve"}, nil)
    fmt.Println("model", model)

    fakeInput, _ := tf.NewTensor([1][60][1]float32{})
    fmt.Println("input:", fakeInput)

    results := model.Exec([]tf.Output{
        model.Op("StatefulPartitionedCall", 0),
    }, map[tf.Output]*tf.Tensor{
        model.Op("lstm_2_input", 0): fakeInput,
    })

    fmt.Println("results", results)

    predictions := results[0]
    fmt.Println(predictions.Value())
}

And the panic shows this error:

starting
2024-01-09 21:04:40.539502: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /Users/rcostumero/Downloads/test_model_go
2024-01-09 21:04:40.544963: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-01-09 21:04:40.545008: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /Users/rcostumero/Downloads/test_model_go
2024-01-09 21:04:40.545080: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-09 21:04:40.572116: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-01-09 21:04:40.578214: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
SIGSEGV: segmentation violation
PC=0x1273c5e5d m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x1001e1580, 0xc0000ed998)
        /usr/local/opt/go/libexec/src/runtime/cgocall.go:157 +0x4b fp=0xc0000ed970 sp=0xc0000ed938 pc=0x100007fab
github.com/galeone/tensorflow/tensorflow/go._Cfunc_TF_LoadSessionFromSavedModel(0x7f7c88081d70, 0x0, 0x7f7c88081090, 0xc0000ae030, 0x1, 0x7f7c8781c600, 0x7f7c8777ae20, 0x7f7c8807ec40)
        _cgo_gotypes.go:1000 +0x4c fp=0xc0000ed998 sp=0xc0000ed970 pc=0x1001cb7ac
github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel.func2(0x7f7c88081d70, 0x5?, 0xc0000edb38, 0x1000a5de0?, 0x1?, 0xc0000c0000?)
       /Users/rcostumero/src/go/pkg/mod/github.com/galeone/tensorflow/tensorflow/go@v0.0.0-20221023090153-6b7fa0680c3e/saved_model.go:72 +0x14d fp=0xc0000eda20 sp=0xc0000ed998 pc=0x1001d85ad
github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel({0x1002a58f8, 0x29}, {0xc0000eddc0, 0x1, 0x9?}, 0xc0000b2680?)
        /Users/rcostumero/src/go/pkg/mod/github.com/galeone/tensorflow/tensorflow/go@v0.0.0-20221023090153-6b7fa0680c3e/saved_model.go:72 +0x2b7 fp=0xc0000edbd0 sp=0xc0000eda20 pc=0x1001d7e97
github.com/galeone/tfgo.LoadModel({0x1002a58f8, 0x29}, {0xc0000eddc0, 0x1, 0x1}, 0x0?)
        /Users/rcostumero/src/go/pkg/mod/github.com/galeone/tfgo@v0.0.0-20230715013254-16113111dc99/model.go:36 +0x65 fp=0xc0000edc20 sp=0xc0000edbd0 pc=0x1001dfe05
main.main()
        /Users/rcostumero/Developer/go/pkg/darwin_amd64/quadro/app.go:39 +0xaf fp=0xc0000edf40 sp=0xc0000edc20 pc=0x1001e026f
runtime.main()
        /usr/local/opt/go/libexec/src/runtime/proc.go:267 +0x2bb fp=0xc0000edfe0 sp=0xc0000edf40 pc=0x100038e9b
runtime.goexit()
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000edfe8 sp=0xc0000edfe0 pc=0x100065101

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000050fa8 sp=0xc000050f88 pc=0x1000392ee
runtime.goparkunlock(...)
        /usr/local/opt/go/libexec/src/runtime/proc.go:404
runtime.forcegchelper()
        /usr/local/opt/go/libexec/src/runtime/proc.go:322 +0xb3 fp=0xc000050fe0 sp=0xc000050fa8 pc=0x100039173
runtime.goexit()
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000050fe8 sp=0xc000050fe0 pc=0x100065101
created by runtime.init.6 in goroutine 1
        /usr/local/opt/go/libexec/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000051778 sp=0xc000051758 pc=0x1000392ee
runtime.goparkunlock(...)
        /usr/local/opt/go/libexec/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
        /usr/local/opt/go/libexec/src/runtime/mgcsweep.go:280 +0x94 fp=0xc0000517c8 sp=0xc000051778 pc=0x1000261b4
runtime.gcenable.func1()
        /usr/local/opt/go/libexec/src/runtime/mgc.go:200 +0x25 fp=0xc0000517e0 sp=0xc0000517c8 pc=0x10001b345
runtime.goexit()
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000517e8 sp=0xc0000517e0 pc=0x100065101
created by runtime.gcenable in goroutine 1
        /usr/local/opt/go/libexec/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc00002c230?, 0x1002e0538?, 0x1?, 0x0?, 0xc0000071e0?)
        /usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000051f70 sp=0xc000051f50 pc=0x1000392ee
runtime.goparkunlock(...)
        /usr/local/opt/go/libexec/src/runtime/proc.go:404
runtime.(*scavengerState).park(0x100516c00)
        /usr/local/opt/go/libexec/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc000051fa0 sp=0xc000051f70 pc=0x100023a69
runtime.bgscavenge(0x0?)
        /usr/local/opt/go/libexec/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc000051fc8 sp=0xc000051fa0 pc=0x100023ffc
runtime.gcenable.func2()
        /usr/local/opt/go/libexec/src/runtime/mgc.go:201 +0x25 fp=0xc000051fe0 sp=0xc000051fc8 pc=0x10001b2e5
runtime.goexit()
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000051fe8 sp=0xc000051fe0 pc=0x100065101
created by runtime.gcenable in goroutine 1
        /usr/local/opt/go/libexec/src/runtime/mgc.go:201 +0xa5

goroutine 18 [finalizer wait]:
runtime.gopark(0x198?, 0x10029b2c0?, 0x1?, 0xa4?, 0x0?)
        /usr/local/opt/go/libexec/src/runtime/proc.go:398 +0xce fp=0xc000050620 sp=0xc000050600 pc=0x1000392ee
runtime.runfinq()
        /usr/local/opt/go/libexec/src/runtime/mfinal.go:193 +0x107 fp=0xc0000507e0 sp=0xc000050620 pc=0x10001a367
runtime.goexit()
        /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000507e8 sp=0xc0000507e0 pc=0x100065101
created by runtime.createfing in goroutine 1
        /usr/local/opt/go/libexec/src/runtime/mfinal.go:163 +0x3d

rax    0x7f7c87839a68
rbx    0x12c412970
rcx    0x7f7c897e5598
rdx    0x55
rdi    0x12c8baa00
rsi    0x7f7c897ffce0
rbp    0x7ff7bfefca10
rsp    0x7ff7bfefca10
r8     0x0
r9     0x3
r10    0x1
r11    0xfffffffffffffdb8
r12    0x7ff7bfefcb20
r13    0x7f7c897e5598
r14    0x7f7c897ecb40
r15    0x7ff7bfefcb00
rip    0x1273c5e5d
rflags 0x10202
cs     0x2b
fs     0x0
gs     0x0
exit status 2

I have tried anything that I have come up to and looked for similar cases but I didn't found any solution.

Any help is more than welcomed.

Thanks in advance!

qlzh727 commented 8 months ago

Triage note: Is it for LSTM only? does it reproduce without RNN related layer? Also adding @k-w-w from save model side for this question.

rcostu commented 8 months ago

Yes, I have replicated the test with the same output after removing the LSTM and no RNN layers.

qlzh727 commented 8 months ago

I see. So this is probably a generic issue for save/loading on the TF stack, @k-w-w is probably the best person for this issue, and I am not so sure how compatible the TF go backend is.

k-w-w commented 8 months ago

It looks like the go wrapper is calling the C++ API (TF_LoadSessionFromSavedModel) to load the SavedModel, but when I try calling method directly I don't see the same issue.

Code:

  std::string saved_model_dir ="/tmp/test_model_go";
  TF_CHECK_OK(tf::Env::Default()->FileExists(saved_model_dir));

  tensorflow::SessionOptions session_options;
  (*session_options.config.mutable_device_count())["GPU"] = 1;
  session_options.config.mutable_gpu_options()
      ->set_per_process_gpu_memory_fraction(0.5);

  TF_SessionOptions* tf_session_options = TF_NewSessionOptions();
  TF_Status* tf_status = TF_NewStatus();

  TF_Graph* tf_graph = TF_NewGraph();
  const char* tags[] = {tf::kSavedModelTagServe};
  TF_Session* sess = TF_LoadSessionFromSavedModel(
      tf_session_options, /*run_options=*/nullptr, saved_model_dir.c_str(),
      tags, /*tags_len=*/1, tf_graph, /*metagraph_buffer=*/nullptr, tf_status);

  TF_DeleteSession( sess, tf_status );

Logs:

I0130 20:59:26.284406  188638 reader.cc:83] Reading SavedModel from: /tmp/test_model_go
I0130 20:59:26.284449  188638 merge.cc:149] Reading binary proto from /tmp/test_model_go/saved_model.pb
I0130 20:59:26.285070  188638 merge.cc:152] Finished reading binary proto, took 624 microseconds.
I0130 20:59:26.285089  188638 reader.cc:51] Reading meta graph with tags { serve }
I0130 20:59:26.285094  188638 reader.cc:153] Reading SavedModel debug info (if present) from: /tmp/test_model_go
I0130 20:59:26.285128  188638 cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0130 20:59:26.308795  188638 mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
I0130 20:59:26.309449  188638 loader.cc:234] Restoring SavedModel bundle.
I0130 20:59:26.338187  188638 loader.cc:218] Running initialization op on SavedModel bundle at path: /tmp/test_model_go
I0130 20:59:26.348788  188638 loader.cc:317] SavedModel load for tags { serve }; Status: success: OK. Took 64387 microseconds.

The only things I can think of is that there is an issue with the wrapper, or that there may be a versioning difference. I'm exporting the model at head, and loading the model at head. The tfgo library is using TensorFlow forked at 2.9. @rcostu Could you try exporting with the same TensorFlow version as the fork?

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 7 months ago

Are you satisfied with the resolution of your issue? Yes No

keras-team / keras

Panic when loading a keras LSTM model in Go #19085