tensorflow / tflite-micro

Infrastructure to enable deployment of ML models to low-power resource-constrained embedded targets (including microcontrollers and digital signal processors).
Apache License 2.0
1.84k stars 803 forks source link

Unexpected behaviour from ReduceMax #2336

Closed cosw0t closed 8 months ago

cosw0t commented 9 months ago

I have a int8 small quantized model in tflite format that I have been trying to run on TFLM and I have been experiencing unexpected beaviour. This model contains a ReduceMax layer and I have reason to believe that the memory management is somewhat incorrect / has a bug in it.

I cannot share the full model but I am seeing the same behaviour on a smaller model made out of just the ReduceMax layer that I will share at the end of this post.

My observations on my test setup are the following:

SUMMARY: MemorySanitizer: use-of-uninitialized-value /proc/self/cwd/external/gemmlowp/fixedpoint/fixedpoint.h:342:26 in int gemmlowp::SaturatingRoundingDoublingHighMul(int, int) Exiting

- When I initilize the tensor arena passed to the `MicroInterpreter` with 0 (or with any value for that matter), the uninitilized memory error goes away. However 90% of my inferences are incorrect (according to my gt labels). Also I find it weird that the tensor arena would need to be initialized externally, is it the case?
- If I increase the size of the tensor arena ten-fold, my inferences are now correct (according to the same labels as above). This suggest that the greedy memory management - presumably acting greedily in the previous step by using a small buffer - leads to memory stomping of some kind. The above also happens with a call to `interpreter.Reset();` between `interpreter.Invoke()` and checking the output.

Again, unfortunately I cannot share the full model, however I have prepared a small test with a tflite model containing just the ReduceMax layer, where I am observing the first 2 errors that I have outlined.

The model was created with:

input_shape = (2, 2, 8)

input = tf.keras.layers.Input(shape=input_shape) out = tf.keras.layers.GlobalMaxPool2D()(input)

model = tf.keras.models.Model(inputs=[input], outputs=out)

converted to tflite, and then int8-quantized before export

The branch https://github.com/cosw0t/tflite-micro/tree/reduce_max_bug_report contains the test under the "examples" folder:
- `tensorflow/lite/micro/examples/reduce_max`
- my test: [`tensorflow/lite/micro/examples/reduce_max/reduce_max_test.cc`](https://github.com/cosw0t/tflite-micro/blob/reduce_max_bug_report/tensorflow/lite/micro/examples/reduce_max/reduce_max_test.cc)
- model under [`tensorflow/lite/micro/examples/reduce_max/models/reduce_max.tflite`](https://github.com/cosw0t/tflite-micro/blob/reduce_max_bug_report/tensorflow/lite/micro/examples/reduce_max/models/reduce_max.tflite)
- run:
    - `bazel run tensorflow/lite/micro/examples/reduce_max:reduce_max_test` (will succeed or fail randomly)
    - `bazel run --config msan tensorflow/lite/micro/examples/reduce_max:reduce_max_test --compilation_mode dbg` to get *an* msan error, albeit not the same. Note that my full use case fails in the Add layer just before the ReduceMax:

==154970==WARNING: MemorySanitizer: use-of-uninitialized-value

0 0x7f7816daeabe in tflite::EvalMaxHelper(TfLiteContext, TfLiteNode, tflite::OpDataReduce*) /proc/self/cwd/tensorflow/lite/micro/kernels/reduce_common.cc:270:7

#1 0x7f7816da8a62 in tflite::EvalMax(TfLiteContext*, TfLiteNode*) /proc/self/cwd/tensorflow/lite/micro/kernels/reduce.cc:52:10
#2 0x7f7816eb428b in tflite::MicroInterpreterGraph::InvokeSubgraph(int) /proc/self/cwd/tensorflow/lite/micro/micro_interpreter_graph.cc:194:34
#3 0x7f7816ec5508 in tflite::MicroInterpreter::Invoke() /proc/self/cwd/tensorflow/lite/micro/micro_interpreter.cc:294:17
#4 0x55f6237ea473 in main /proc/self/cwd/tensorflow/lite/micro/examples/reduce_max/reduce_max_test.cc:39:3
#5 0x7f7816567d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#6 0x7f7816567e3f in __libc_start_main csu/../csu/libc-start.c:392:3
#7 0x55f623763814 in _start (/home/msanna/.cache/bazel/_bazel_msanna/29aa5494e2febf9342269ff9626570df/execroot/tflite_micro/bazel-out/k8-dbg/bin/tensorflow/lite/micro/examples/reduce_max/reduce_max_test+0x20814) (BuildId: b9b64951dbbe5c4400b87c3c88266d39bf691fc6)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /proc/self/cwd/tensorflow/lite/micro/kernels/reduce_common.cc:270:7 in tflite::EvalMaxHelper(TfLiteContext, TfLiteNode, tflite::OpDataReduce*) Exiting



I hope this helps identify any sort of issues. Even if the issue is in my test case it would be great if someone could help me find out.

Thanks in advance. 
rascani commented 9 months ago

Thank you for such a detailed bug report. We've recently encountered other similar bugs, so I don't find this terribly surprising. Many data structures are allocated from the arena or on the stack and only get sparsely populated. The two msan reports you provided made it easy to pin down two instances of these for Add & ReduceMax. I'll put up a PR to fix those. I'll also look into how we can address this more systematically.

When I initilize the tensor arena passed to the MicroInterpreter with 0 (or with any value for that matter), the uninitilized memory error goes away. However 90% of my inferences are incorrect (according to my gt labels). Also I find it weird that the tensor arena would need to be initialized externally, is it the case?

If I increase the size of the tensor arena ten-fold, my inferences are now correct (according to the same labels as above). This suggest that the greedy memory management - presumably acting greedily in the previous step by using a small buffer - leads to memory stomping of some kind. The above also happens with a call to interpreter.Reset(); between interpreter.Invoke() and checking the output.

I find this concerning and I don't have a good explanation for this yet. No, you shouldn't need to initialize the tensor arena to a known pattern. Are you sure you're providing a big enough arena in the first place?

cosw0t commented 9 months ago

Thank you for such a detailed bug report. We've recently encountered other similar bugs, so I don't find this terribly surprising. Many data structures are allocated from the arena or on the stack and only get sparsely populated. The two msan reports you provided made it easy to pin down two instances of these for Add & ReduceMax. I'll put up a PR to fix those. I'll also look into how we can address this more systematically.

When I initilize the tensor arena passed to the MicroInterpreter with 0 (or with any value for that matter), the uninitilized memory error goes away. However 90% of my inferences are incorrect (according to my gt labels). Also I find it weird that the tensor arena would need to be initialized externally, is it the case?

If I increase the size of the tensor arena ten-fold, my inferences are now correct (according to the same labels as above). This suggest that the greedy memory management - presumably acting greedily in the previous step by using a small buffer - leads to memory stomping of some kind. The above also happens with a call to interpreter.Reset(); between interpreter.Invoke() and checking the output.

I find this concerning and I don't have a good explanation for this yet. No, you shouldn't need to initialize the tensor arena to a known pattern. Are you sure you're providing a big enough arena in the first place?

Thanks for your reply. It's occurred to me that the way I have exposed the issue might be confusing.

My interpretation is that the only reason why my code runs after initializing the tensor arena is that now op_data->input_zp and op_data->output_zp come from the same pool of memory that has been manually initialized. They will be equal (so no error n.1) and no msan error either (n.2). However this value might be still wrong. In reality it should be whatever is in the tflite model, correct?

Maybe what comes later - the problem with a small/large enough pool - is a kind-of avalanche effect. It's probably wise to into it again after the first 2 points are fixed. I am not 100% sure the arena that I allocate is big enough, but I trust that the runtime will tell me if it's not and fail earlier. Is this not the case?

Adding to that I have just rebased onto this https://github.com/tensorflow/tflite-micro/pull/2305 and I get another error, this time here https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h#L52C4

The shapes are input: [4, 1, 49, 3] and output: [1, 1, 45, 30] - the input shape is wrong, it should be [1, 1, 49, 3]

This is a few layers before the Add and the MaxReduce so I'm not sure it still relates to this bug.

Here's the stack:

libtensorflow_Slite_Smicro_Skernels_Slibmicro_Uops.so!tflite::MatchingDim(const tflite::RuntimeShape & shape1, int index1, const tflite::RuntimeShape & shape2, int index2) (\home\msanna\workspace\tflite-micro\tensorflow\lite\kernels\internal\types.h:266)
libtensorflow_Slite_Smicro_Skernels_Slibmicro_Uops.so!tflite::reference_integer_ops::ConvPerChannel(const tflite::ConvParams & params, const int32_t * output_multiplier, const int32_t * output_shift, const tflite::RuntimeShape & input_shape, const int8_t * input_data, const tflite::RuntimeShape & filter_shape, const int8_t * filter_data, const tflite::RuntimeShape & bias_shape, const int32_t * bias_data, const tflite::RuntimeShape & output_shape, int8_t * output_data) (\home\msanna\workspace\tflite-micro\tensorflow\lite\kernels\internal\reference\integer_ops\conv.h:52)
libtensorflow_Slite_Smicro_Skernels_Slibmicro_Uops.so!tflite::(anonymous namespace)::Eval(TfLiteContext * context, TfLiteNode * node) (\home\msanna\workspace\tflite-micro\tensorflow\lite\micro\kernels\conv.cc:116)
libtensorflow_Slite_Smicro_Slibmicro_Uinterpreter_Ugraph.so!tflite::MicroInterpreterGraph::InvokeSubgraph(tflite::MicroInterpreterGraph * const this, int subgraph_idx) (\home\msanna\workspace\tflite-micro\tensorflow\lite\micro\micro_interpreter_graph.cc:194)
libtensorflow_Slite_Smicro_Slibmicro_Uframework.so!tflite::MicroInterpreter::Invoke(tflite::MicroInterpreter * const this) (\home\msanna\workspace\tflite-micro\tensorflow\lite\micro\micro_interpreter.cc:294)

let me know if you would like me to provide a toy example for this error too

rascani commented 9 months ago

My interpretation is that the only reason why my code runs after initializing the tensor arena is that now op_data->input_zp and op_data->output_zp come from the same pool of memory that has been manually initialized. They will be equal (so no error n.1) and no msan error either (n.2). However this value might be still wrong. In reality it should be whatever is in the tflite model, correct?

Correct, in #2337, I also properly initialized the input_zp and output_zp. However, in the EvalMaxHelper, those are really only used in the ENSURE clause to make sure we're using the same quantization params, they don't impact the ensuing math.

Maybe what comes later - the problem with a small/large enough pool - is a kind-of avalanche effect. It's probably wise to into it again after the first 2 points are fixed. I am not 100% sure the arena that I allocate is big enough, but I trust that the runtime will tell me if it's not and fail earlier. Is this not the case?

It should, though there are some edge cases that I've run into before. You can use the RecordingMicroAllocator to get a better idea of the arena usage.

Adding to that I have just rebased onto this https://github.com/tensorflow/tflite-micro/pull/2305 and I get another error, this time here main/tensorflow/lite/kernels/internal/reference/integer_ops/conv.h#L52C4

Make sure you were actually using PR #2337, not #2305. For the conv issue, I'd suggest visualizing the model with a tool like Netron and inspect the shape from the TFLite flatbuffer. That would help narrow it between an issue in the model creation vs the TFLM runtime.

rascani commented 9 months ago

Additional question, are you placing the tensor arena on the stack for your actual model? I'd generally expect most models to require a big enough arena that it might overflow the stack.

cosw0t commented 9 months ago

Make sure you were actually using PR #2337, not #2305. For the conv issue, I'd suggest visualizing the model with a tool like Netron and inspect the shape from the TFLite flatbuffer. That would help narrow it between an issue in the model creation vs the TFLM runtime.

The shape is correctly reported by Netron as [1, 1, 49, 3]. I have rebased on the latest main, which at the time of writing includes #2305 and #2335. I was suprised to see that these 2 introduced a new issue. I have created this branch to reproduce: https://github.com/cosw0t/tflite-micro/tree/bug_report/conv2d_main

Following your advice to consider only #2337 I've dropped these 2 commits, and pulled #2337 instead. Now if I leave the tensor area uninitilized there is no error related to junk values for op_data->input_zp != op_data->output_zp in ReduceMax (yay)

However msan still complains about the ADD on the same spot. I think this is coming from a Conv2D with dilation [2,2], and I have managed to reproduce it here: https://github.com/cosw0t/tflite-micro/tree/bug_report/conv2d_2337

Every other observation about correctness of results depending on tensor arena size remains unchanged.

Additional question, are you placing the tensor arena on the stack for your actual model? I'd generally expect most models to require a big enough arena that it might overflow the stack.

Yes it's on the stack, and it does complain if the amount is not enough.

I have 3 separate issues now and they show up depending on which branch of tflm i rebase my sample on to.

Technically the reduce max is fixed so I thought I'd migrate the discussion about the Conv2D issue to #2338 with the same information I just reported - the Conv2D Also there we don't mix the discussion about bugs/asserts and observable behaviour.

Once the Conv2D works I will be happy to report what else is broken - including your suggestion to run with RecordingMicroAllocator

thanks!

rascani commented 9 months ago

However msan still complains about the ADD on the same spot. I think this is coming from a Conv2D with dilation [2,2], and I have managed to reproduce it here: cosw0t/tflite-micro@bug_report/conv2d_2337

Yeah, looks like I made an incorrect assumption that it was the uninitialized ArithmeticParams causing the issue. When adding -fsanitize-memory-track-origins, it points to something uninitialized in the tensor_arena. That isn't really helpful though, as we overlay a bunch of stuff in there. I also added initializing the OpDataAdd object, and that did not help either.

Yes it's on the stack, and it does complain if the amount is not enough.

Be careful with the size of the arena on the stack. If you're just running on desktop linux, each thread typically gets a large stack and a 5K arena should fit okay. But when running on MCUs, you'll likely have significantly smaller stack sizes. A 4K stack is somewhat common. We will typically mark tensor_arenas as static, so they're placed in the BSS section. Notably, that also excludes it from MSAN analysis, as that only looks at the stack and heap.

github-actions[bot] commented 8 months ago

"This issue is being marked as stale due to inactivity. Remove label or comment to prevent closure in 5 days."

github-actions[bot] commented 8 months ago

"This issue is being closed because it has been marked as stale for 5 days with no further activity."