TFLite crash (SIGABRT) while running Conv3D on Android

Mr-Grieves commented 4 years ago

System information

Have I written custom code: yes
OS Platform and Distribution: Build environment is Linux Ubuntu 16.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Runtime environment is Samsung s8+

TensorFlow installed from (source or binary): built from source with select-tf-ops using:

bazel build --cxxopt='--std=c++11' -c opt \
--config=android_arm --config=monolithic \
//tensorflow/lite/java:tensorflow-lite-with-select-tf-ops

TensorFlow version: 1.15.0rc2
Keras version: 2.2.4-tf
Python version: 3.6.8
Bazel version (if compiling from source): 0.25.2
GCC/Compiler version (if compiling from source): 5.4.0
CUDA/cuDNN version: 10.0
GPU model and memory: not relevant

Describe the current behavior I am trying to get a network (with conv3d ops) to run on my Android system using TFLite. I have followed all the steps mentioned here, and I can convert the network without issue. During runtime, I also appear to be able to load the converted tflite model without issue, however, during my call to runForMultipleInputsOutputs(), tflite crashes giving me a SIBABRT coming from libtensorflowlite_flex_jni.so (full stack trace below).

Describe the expected behavior I expect it to not crash when running

Code to reproduce the issue I made a dummy network to try and isolate the issue. I built the network using the following:

import tensorflow as tf
import numpy as np

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv3D(1, (4, 4, 4), input_shape=(4, 8, 8, 1), name='conv'))
model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,
                      optimizer=tf.keras.optimizers.RMSprop(lr=0.0001),
                      metrics=[tf.keras.metrics.categorical_accuracy])

x = np.random.random((1, 4, 8, 8, 1))
y = np.random.random((1, 1, 5, 5, 1))
model.train_on_batch(x, y)
model.predict(x)

# Save tf.keras model in HDF5 format
keras_file = "conv3d.h5"
tf.keras.models.save_model(model, keras_file)

# Convert the model to tflite format
converter = tf.lite.TFLiteConverter.from_keras_model_file(keras_file)
converter.target_ops = [tf.lite.OpsSet.TFLITE_BUILTINS,
                                       tf.lite.OpsSet.SELECT_TF_OPS]
tflite_model = converter.convert()
open("conv3d.tflite", "wb").write(tflite_model)

I then load and run the model on the android using ByteBuffers to hold the input/outputs. I can provide this code if requested, but I don't suspect it to be the problem as I use it for other working projects. I'm confident that this is a conv3d issue, because I have also built a conv2d dummy network, using the exact same build procedure + runtime environment, and it runs without crashing.

Other info / logs The full android backtrace during the call to runForMultipleInputsOutputs():

A/DEBUG: *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
A/DEBUG: Build fingerprint: 'samsung/dream2ltexx/dream2lte:9/PPR1.180610.011/G955FXXS5DSI1:user/release-keys'
A/DEBUG: Revision: '10'
A/DEBUG: ABI: 'arm'
A/DEBUG: pid: 6774, tid: 6817, name: Thread-2  >>> com.segmentation.qussegserviceNVW <<<
A/DEBUG: signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
A/DEBUG:     r0  00000000  r1  00001aa1  r2  00000006  r3  00000008
A/DEBUG:     r4  00001a76  r5  00001aa1  r6  b7628eac  r7  0000010c
A/DEBUG:     r8  b7629014  r9  b7628fa0  r10 b762903c  r11 e4095c70
A/DEBUG:     ip  b7628e48  sp  b7628e98  lr  e6c73e71  pc  e6c6ae62
A/DEBUG: backtrace:
A/DEBUG:     #00 pc 0001ce62  /system/lib/libc.so (abort+58)
A/DEBUG:     #01 pc 002181cd  /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/lib/arm/libtensorflowlite_flex_jni.so
A/DEBUG:     #02 pc 002213fd  /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/lib/arm/libtensorflowlite_flex_jni.so
A/DEBUG:     #03 pc 00225795  /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/lib/arm/libtensorflowlite_flex_jni.so
A/DEBUG:     #04 pc 0021fb63  /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/lib/arm/libtensorflowlite_flex_jni.so
A/DEBUG:     #05 pc 00335dcd  /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/lib/arm/libtensorflowlite_flex_jni.so
A/DEBUG:     #06 pc 00338313  /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/lib/arm/libtensorflowlite_flex_jni.so
A/DEBUG:     #07 pc 0020925b  /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/lib/arm/libtensorflowlite_flex_jni.so (Java_org_tensorflow_lite_NativeInterpreterWrapper_run+26)
A/DEBUG:     #08 pc 00415879  /system/lib/libart.so (art_quick_generic_jni_trampoline+40)
A/DEBUG:     #09 pc 00411375  /system/lib/libart.so (art_quick_invoke_stub_internal+68)
A/DEBUG:     #10 pc 003ea57b  /system/lib/libart.so (art_quick_invoke_static_stub+222)
A/DEBUG:     #11 pc 000a1627  /system/lib/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+154)
A/DEBUG:     #12 pc 001e88c9  /system/lib/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+236)
A/DEBUG:     #13 pc 001e33b7  /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+814)
A/DEBUG:     #14 pc 003e60af  /system/lib/libart.so (MterpInvokeStatic+130)
A/DEBUG:     #15 pc 00404294  /system/lib/libart.so (ExecuteMterpImpl+14612)
A/DEBUG:     #16 pc 001aa16c  /dev/ashmem/dalvik-classes.dex extracted in memory from /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/base.apk_6774_6774 (deleted) (org.tensorflow.lite.NativeInterpreterWrapper.run+164)
A/DEBUG:     #17 pc 001c7b33  /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.2760711098+378)
A/DEBUG:     #18 pc 001cc219  /system/lib/libart.so (art::interpreter::ArtInterpreterToInterpreterBridge(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*, art::JValue*)+152)
A/DEBUG:     #19 pc 001e339f  /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+790)
A/DEBUG:     #20 pc 003e50d3  /system/lib/libart.so (MterpInvokeVirtual+442)
A/DEBUG:     #21 pc 00404114  /system/lib/libart.so (ExecuteMterpImpl+14228)
A/DEBUG:     #22 pc 001a9962  /dev/ashmem/dalvik-classes.dex extracted in memory from /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/base.apk_6774_6774 (deleted) (org.tensorflow.lite.Interpreter.runForMultipleInputsOutputs+10)
A/DEBUG:     #23 pc 001c7b33  /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.2760711098+378)
A/DEBUG:     #24 pc 001cc15f  /system/lib/libart.so (art::interpreter::EnterInterpreterFromEntryPoint(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*)+82)
A/DEBUG:     #25 pc 003d8bb9  /system/lib/libart.so (artQuickToInterpreterBridge+880)
A/DEBUG:     #26 pc 004158ff  /system/lib/libart.so (art_quick_to_interpreter_bridge+30)
A/DEBUG:     #27 pc 0001c0fd  /dev/ashmem/dalvik-jit-code-cache_6774_6774 (deleted) (com.segmentation.qussegserviceNVW.TensorFlowSegmentRunner$SegNetRunner.segmentChunk+492)
A/DEBUG:     #28 pc 004113bb  /system/lib/libart.so (art_quick_osr_stub+42)
A/DEBUG:     #29 pc 0024d8a9  /system/lib/libart.so (art::jit::Jit::MaybeDoOnStackReplacement(art::Thread*, art::ArtMethod*, unsigned int, int, art::JValue*)+1388)
A/DEBUG:     #30 pc 003e9aab  /system/lib/libart.so (MterpMaybeDoOnStackReplacement+86)
A/DEBUG:     #31 pc 00410bf4  /system/lib/libart.so (ExecuteMterpImpl+66164)
A/DEBUG:     #32 pc 0002e7b8  /dev/ashmem/dalvik-classes2.dex extracted in memory from /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/base.apk!classes2.dex_6774_6774 (deleted) (com.segmentation.qussegserviceNVW.TensorFlowSegmentRunner$SegNetRunner.segmentChunk+76)
A/DEBUG:     #33 pc 001c7b33  /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.2760711098+378)
A/DEBUG:     #34 pc 001cc219  /system/lib/libart.so (art::interpreter::ArtInterpreterToInterpreterBridge(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*, art::JValue*)+152)
A/DEBUG:     #35 pc 001e339f  /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+790)
A/DEBUG:     #36 pc 003e5f61  /system/lib/libart.so (MterpInvokeDirect+196)
A/DEBUG:     #37 pc 00404214  /system/lib/libart.so (ExecuteMterpImpl+14484)
A/DEBUG:     #38 pc 0002e750  /dev/ashmem/dalvik-classes2.dex extracted in memory from /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/base.apk!classes2.dex_6774_6774 (deleted) (com.segmentation.qussegserviceNVW.TensorFlowSegmentRunner$SegNetRunner.access$100)
A/DEBUG:     #39 pc 001c7b33  /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.2760711098+378)
A/DEBUG:     #40 pc 001cc15f  /system/lib/libart.so (art::interpreter::EnterInterpreterFromEntryPoint(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*)+82)
A/DEBUG:     #41 pc 003d8bb9  /system/lib/libart.so (artQuickToInterpreterBridge+880)
A/DEBUG:     #42 pc 004158ff  /system/lib/libart.so (art_quick_to_interpreter_bridge+30)
A/DEBUG:     #43 pc 0001b5fd  /dev/ashmem/dalvik-jit-code-cache_6774_6774 (deleted) (com.segmentation.qussegserviceNVW.TensorFlowSegmentRunner.segmentFrame+604)
A/DEBUG:     #44 pc 00411375  /system/lib/libart.so (art_quick_invoke_stub_internal+68)
A/DEBUG:     #45 pc 003ea479  /system/lib/libart.so (art_quick_invoke_stub+224)
A/DEBUG:     #46 pc 000a1615  /system/lib/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+136)
A/DEBUG:     #47 pc 001e88c9  /system/lib/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+236)
A/DEBUG:     #48 pc 001e33b7  /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+814)
A/DEBUG:     #49 pc 003e50d3  /system/lib/libart.so (MterpInvokeVirtual+442)
A/DEBUG:     #50 pc 00404114  /system/lib/libart.so (ExecuteMterpImpl+14228)
A/DEBUG:     #51 pc 0002edb4  /dev/ashmem/dalvik-classes2.dex extracted in memory from /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/base.apk!classes2.dex_6774_6774 (deleted) (com.segmentation.qussegserviceNVW.TensorFlowSegmentRunner.segmentCine+40)
A/DEBUG:     #52 pc 001c7b33  /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.2760711098+378)
A/DEBUG:     #53 pc 001cc219  /system/lib/libart.so (art::interpreter::ArtInterpreterToInterpreterBridge(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*, art::JValue*)+152)
A/DEBUG:     #54 pc 001e339f  /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+790)
A/DEBUG:     #55 pc 003e5ca3  /system/lib/libart.so (MterpInvokeInterface+1010)
A/DEBUG:     #56 pc 00404314  /system/lib/libart.so (ExecuteMterpImpl+14740)
A/DEBUG:     #57 pc 00028310  /dev/ashmem/dalvik-classes2.dex extracted in memory from /data/app/com.segmentation.qussegserviceNVW-52gMpws9Z9fl4lOXs6XvOw==/base.apk!classes2.dex_6774_6774 (deleted) (com.segmentation.qussegserviceNVW.CinePlayerActivity$3$1.run+20)
A/DEBUG:     #58 pc 001c7b33  /system/lib/libart.so (_ZN3art11interpreterL7ExecuteEPNS_6ThreadERKNS_20CodeItemDataAccessorERNS_11ShadowFrameENS_6JValueEb.llvm.2760711098+378)
A/DEBUG:     #59 pc 001cc15f  /system/lib/libart.so (art::interpreter::EnterInterpreterFromEntryPoint(art::Thread*, art::CodeItemDataAccessor const&, art::ShadowFrame*)+82)
A/DEBUG:     #60 pc 003d8bb9  /system/lib/libart.so (artQuickToInterpreterBridge+880)
A/DEBUG:     #61 pc 004158ff  /system/lib/libart.so (art_quick_to_interpreter_bridge+30)
A/DEBUG:     #62 pc 00411375  /system/lib/libart.so (art_quick_invoke_stub_internal+68)
A/DEBUG:     #63 pc 003ea479  /system/lib/libart.so (art_quick_invoke_stub+224)
A/DEBUG:     #64 pc 000a1615  /system/lib/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+136)
A/DEBUG:     #65 pc 0034b0c5  /system/lib/libart.so (art::(anonymous namespace)::InvokeWithArgArray(art::ScopedObjectAccessAlreadyRunnable const&, art::ArtMethod*, art::(anonymous namespace)::ArgArray*, art::JValue*, char const*)+52)
A/DEBUG:     #66 pc 0034be1d  /system/lib/libart.so (art::InvokeVirtualOrInterfaceWithJValues(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, _jmethodID*, jvalue*)+320)
A/DEBUG:     #67 pc 0036d203  /system/lib/libart.so (art::Thread::CreateCallback(void*)+866)
A/DEBUG:     #68 pc 00064899  /system/lib/libc.so (__pthread_start(void*)+140)
A/DEBUG:     #69 pc 0001e329  /system/lib/libc.so (__start_thread+24)

jdduke commented 4 years ago

Thanks for the report. Have you tried with a non-optimized build? That might help provide symbols for the crash stack.

Assigning to @miaout17 for further assistance.

Mr-Grieves commented 4 years ago

@jdduke: I just tried rebuilding my .aar without -c opt and it did not change anything in the crash stack.

Also, as a small update: this issue does not seems to be present when I build the test network using tensorflow 2.0. Unfortunately, the actual keras network I am trying to work with was trained using v1.13.1 so that does not solve my issue.

jdduke commented 4 years ago

I'm curious, if you change your conversion target ops to just

converter.target_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]

Does that make any difference?

Mr-Grieves commented 4 years ago

@jdduke: Without the SELECT_OPS, my converter script fails to create the tflite model. It gives the following errors:

2019-11-26 18:14:50.278227: E tensorflow/core/grappler/grappler_item_builder.cc:656] Init node conv/kernel/Assign doesn't exist in graph
Traceback (most recent call last):
  File "convert_conv3d", line 23, in <module>
    tflite_model = converter.convert()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/lite/python/lite.py", line 983, in convert
    **converter_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/lite/python/convert.py", line 449, in toco_convert_impl
    enable_mlir_converter=enable_mlir_converter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/lite/python/convert.py", line 200, in toco_convert_protos
    raise ConverterError("See console for info.\n%s\n%s\n" % (stdout, stderr))
tensorflow.lite.python.convert.ConverterError: See console for info.
2019-11-26 18:14:51.414259: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Conv3D
2019-11-26 18:14:51.414371: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before Removing unused ops: 4 operators, 7 arrays (0 quantized)
2019-11-26 18:14:51.415422: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before general graph transformations: 4 operators, 7 arrays (0 quantized)
2019-11-26 18:14:51.415487: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] After general graph transformations pass 1: 2 operators, 5 arrays (0 quantized)
2019-11-26 18:14:51.415521: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before Group bidirectional sequence lstm/rnn: 2 operators, 5 arrays (0 quantized)
2019-11-26 18:14:51.415532: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before dequantization graph transformations: 2 operators, 5 arrays (0 quantized)
2019-11-26 18:14:51.415576: I tensorflow/lite/toco/allocate_transient_arrays.cc:345] Total transient array allocated size: 0 bytes, theoretical optimal value: 0 bytes.
2019-11-26 18:14:51.415583: I tensorflow/lite/toco/toco_tooling.cc:454] Number of parameters: 1001
2019-11-26 18:14:51.416249: E tensorflow/lite/toco/toco_tooling.cc:481] We are continually in the process of adding support to TensorFlow Lite for more ops. It would be helpful if you could inform us of how this conversion went by opening a github issue at https://github.com/tensorflow/tensorflow/issues/new?template=40-tflite-op-request.md
 and pasting the following:

Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If those are native TensorFlow operators, you might be able to use the extended runtime by passing --enable_select_tf_ops, or by setting target_ops=TFLITE_BUILTINS,SELECT_TF_OPS when calling tf.lite.TFLiteConverter(). Otherwise, if you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: ADD. Here is a list of operators for which you will need custom implementations: Conv3D.

jdduke commented 4 years ago

Ah, sorry, what I meant to write was change the target ops to just:

converter.target_ops = [lite.OpsSet.SELECT_TF_OPS]

Mr-Grieves commented 4 years ago

Oho! that appears to have solved it!! Great. Thank you for the suggestion

tensorflowbutler commented 3 years ago

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No

tensorflow / tensorflow

TFLite crash (SIGABRT) while running Conv3D on Android #34286