Open tagunil opened 2 weeks ago
Hi, @tagunil
Thank you for bringing this issue to our attention and if possible could you please help us with Google colab notebook along with model to replicate the same behavior from our end to investigate this issue further from our end ?
Thank you for your cooperation and patience.
Hi @gaikwadrahul8, sure. I just don't want to clutter the issue with irrelevant details, so I'll make a simple reproducer and share the link with you.
@gaikwadrahul8, I've attached a simple Jupyter notebook reproducing the crash to the issue. https://github.com/user-attachments/files/16894947/reproducer.zip
BTW, I know that the "while" op is not quantizable as for now, but that's a separate issue. We might consider selective quantization, for example, but the segfault in the calibrator won't allow us doing that.
Hi, @tagunil
I apologize for the delayed response, I tried to run your jupyter notebook in Google colab with CPU
and GPU
runtime after this line converter.convert()
the colab session is crashing for an unknown reason for reference I've added runtime log here so did you face that issue while running your code in Google colab ?
Thank you for your cooperation and patience.
Hi @gaikwadrahul8,
I'm running it locally, but it crashes all the same, and it's the crash itself that I'm reporting here, because I've traced it to the root cause and it's a null pointer dereference inside the calibrator.
@tagunil, If you have a workaround to address this issue we welcome a pull request. Our team will review it carefully and merge it if it aligns with our coding standards.
Hi, @pkgoogle Please take look into this issue. Thank you
Hi @tagunil, forgive me if I'm incorrect but isn't:
@tf.function
def call(self, inputs):
windows = tf.shape(inputs)[0]
height = tf.shape(inputs)[1]
width = tf.shape(inputs)[2]
sums = tf.zeros((windows, width), dtype=self.dtype)
for y in tf.range(height):
sums += tf.squeeze(tf.slice(inputs, [0, y, 0], [-1, 1, -1]), axis=1)
return sums
equivalent to
tf.math.reduce_sum(inputs, axis=1, keep_dims=False)
If so, can you use that to get past this issue?
Hi @pkgoogle,
It's just a simplest reproducer for the crash I could think of. Of course, you are right in that particular case, but, unfortunately, our real model is much more complicated and cannot be implemented without a while op. We want to quantize it partially, but we still need to run the calibrator over the whole model for that.
I was able to replicate with your original code... colab gist here for convenience.
I attempted to do a an AI-Edge-Torch conversion instead to see if it could solve this:
import ai_edge_torch
import torch
import torch.nn as nn
import torch.nn.functional as F
WINDOWS = 1
HEIGHT = 32
WIDTH = 32
class CustomModel(nn.Module):
def __init__(self, input_shape):
super().__init__()
self.windows = input_shape[0]
self.height = input_shape[1]
self.width = input_shape[2]
self.dense = nn.Linear(self.width, 1)
def forward(self, x):
accumulation = torch.zeros(self.windows, self.width)
for i in torch.arange(self.height):
accumulation += torch.squeeze(x[:,i,:])
x = accumulation
x = self.dense(x)
x = F.sigmoid(x)
return x
model = CustomModel((WINDOWS, HEIGHT, WIDTH))
sample_input = (torch.randn(WINDOWS, HEIGHT, WIDTH),)
edge_model = ai_edge_torch.convert(model.eval(), sample_input)
edge_model.export("while.tflite")
This runs into a dynamic slicing issue with PyTorch Export, I'm guessing because most people would do the reduce sum operation instead. @tagunil, I would take a second look to see if perhaps you can vectorize your code and avoid a while loop/range. @vamsimanchala, can you please take a look? Thanks.
@pkgoogle, unfortunately, our use case is pretty specific and it would take a lot of effort to rework it in a vectorized way, even if possible. By the way, we've worked around the issue by adding a null pointer check to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/optimize/calibration/calibrator.cc#L267. I would make a pull request already if the open source contribution policy in our company were not so uncertain, although we're trying to clarify it.
And yes, dynamic slicing is another pain point, but for now we work it around by using tf.slice directly instead of relying on Python slicing.
Hmm... a check is probably correct ... being optional doesn't mean in all cases it's not "loggable", but it does mean that that tensor sometimes is not allocated.
Edit: Actually just below, optional tensors are not loggable: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/optimize/calibration/calibrator.cc#L283 ... so root solution will be different ... @tagunil, do you have a stack trace? Feel free to censor/change PII.
Hi @pkgoogle,
The problem with the optional tensor logic in this particular case is that the list of loggable tensors is created before the while op decides to deallocate one of its tensors (the deallocation happens at the prepare stage). So either we need to create the loggable list later, just between prepare and eval, or we need to add an additional check before trying to actually log the tensor. The latter option works at least in our case.
Of course, I can make a stack trace for you later today or tomorrow.
@pkgoogle, so here's a trace from a debug build of TF v2.17.0, up to the Python wrapper. Let me know if you want me to capture anything else.
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffe55db86c9 in tflite::optimize::calibration::MinMax::Update (this=0x55555a146838, values=0x0, tensor_size=1024, error_reporter=0x55555a1476b0) at tensorflow/lite/tools/optimize/calibration/calibration_logger.cc:36
36 tensorflow/lite/tools/optimize/calibration/calibration_logger.cc: No such file or directory.
(gdb) bt
#0 0x00007ffe55db86c9 in tflite::optimize::calibration::MinMax::Update (this=0x55555a146838, values=0x0,
tensor_size=1024, error_reporter=0x55555a1476b0)
at tensorflow/lite/tools/optimize/calibration/calibration_logger.cc:36
#1 0x00007ffe55d06928 in tflite::optimize::calibration::Logger::LogTensorValue (this=0x55555ad07bf0,
subgraph_index=0, tensor_index=10, tensor_values=0x0, tensor_size=1024, error_reporter=0x55555a1476b0)
at ./tensorflow/lite/tools/optimize/calibration/calibration_logger.h:56
#2 0x00007ffe55cfff61 in tflite::optimize::calibration::(anonymous namespace)::LoggingEval (context=0x55555ae0c578,
node=0x55555b28a1b0) at tensorflow/lite/tools/optimize/calibration/calibrator.cc:266
#3 0x00007ffe55f96c4e in tflite::Subgraph::OpInvoke (this=0x55555ae0c550, op_reg=..., node=0x55555b28a1b0)
at tensorflow/lite/core/subgraph.cc:1428
#4 0x00007ffe55f977fa in tflite::Subgraph::InvokeImpl (this=0x55555ae0c550) at tensorflow/lite/core/subgraph.cc:1724
#5 0x00007ffe55f971a9 in tflite::Subgraph::Invoke (this=0x55555ae0c550) at tensorflow/lite/core/subgraph.cc:1617
#6 0x00007ffe55e7a7c1 in tflite::impl::Interpreter::Invoke (this=0x55555b3f02c0)
at tensorflow/lite/core/interpreter.cc:242
#7 0x00007ffe551dc182 in tflite::calibration_wrapper::CalibrationWrapper::FeedTensor (this=0x55555aa6b990,
input_value=0x7fff3812ae80) at tensorflow/lite/python/optimize/calibration_wrapper.cc:499
#8 0x00007ffe551a0340 in pybind11_init__pywrap_tensorflow_lite_calibration_wrapper(pybind11::module_&)::$_7::operator()(tflite::calibration_wrapper::CalibrationWrapper&, pybind11::handle&) const (this=0x55555b050578, self=...,
input_value=...) at tensorflow/lite/python/optimize/calibration_wrapper_pybind11.cc:77
Thanks @tagunil, my current thinking is to remove the tensor from the loggable collection when we deallocate to keep the state consistent, that or reorder things (but that may cause further butterflies).
Well, @pkgoogle, you obviously know better. I'm just saying that a simple null pointer check before logging the tensor also seem to work.
1. System information
2. Code
reproducer.zip
3. Failure after conversion
Segmentation fault (signal 11) during conversion
5. (optional) Any other info / logs
The "while" operation does a check if an output tensor of the body subgraph is the same as the corresponding input tensor. If it's the case, it deallocates its own output tensor. The check is done at the prepare stage, so the affected tensor is already included in the "loggable_outputs" list by the calibrator. Then the calibrator tries to read the data from the deallocated tensor and segfaults. I've debugged it up to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/optimize/calibration/calibrator.cc#L267 and found that the
tensor.data.f == nullptr
. The check in question was introduced between 2.13 and 2.14, so it might be considered a regression: https://github.com/tensorflow/tensorflow/commit/7d49fd431ee5cebbb76eda88bc17e48921e10c85