tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.68k stars 74.19k forks source link

Calibrator segfaults trying to log the "while" operation #75140

Open tagunil opened 2 weeks ago

tagunil commented 2 weeks ago

1. System information

2. Code

reproducer.zip

3. Failure after conversion

Segmentation fault (signal 11) during conversion

5. (optional) Any other info / logs

The "while" operation does a check if an output tensor of the body subgraph is the same as the corresponding input tensor. If it's the case, it deallocates its own output tensor. The check is done at the prepare stage, so the affected tensor is already included in the "loggable_outputs" list by the calibrator. Then the calibrator tries to read the data from the deallocated tensor and segfaults. I've debugged it up to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/optimize/calibration/calibrator.cc#L267 and found that the tensor.data.f == nullptr. The check in question was introduced between 2.13 and 2.14, so it might be considered a regression: https://github.com/tensorflow/tensorflow/commit/7d49fd431ee5cebbb76eda88bc17e48921e10c85

gaikwadrahul8 commented 2 weeks ago

Hi, @tagunil

Thank you for bringing this issue to our attention and if possible could you please help us with Google colab notebook along with model to replicate the same behavior from our end to investigate this issue further from our end ?

Thank you for your cooperation and patience.

tagunil commented 2 weeks ago

Hi @gaikwadrahul8, sure. I just don't want to clutter the issue with irrelevant details, so I'll make a simple reproducer and share the link with you.

tagunil commented 2 weeks ago

@gaikwadrahul8, I've attached a simple Jupyter notebook reproducing the crash to the issue. https://github.com/user-attachments/files/16894947/reproducer.zip

tagunil commented 2 weeks ago

BTW, I know that the "while" op is not quantizable as for now, but that's a separate issue. We might consider selective quantization, for example, but the segfault in the calibrator won't allow us doing that.

gaikwadrahul8 commented 2 weeks ago

Hi, @tagunil

I apologize for the delayed response, I tried to run your jupyter notebook in Google colab with CPU and GPU runtime after this line converter.convert() the colab session is crashing for an unknown reason for reference I've added runtime log here so did you face that issue while running your code in Google colab ?

Thank you for your cooperation and patience.

tagunil commented 2 weeks ago

Hi @gaikwadrahul8,

I'm running it locally, but it crashes all the same, and it's the crash itself that I'm reporting here, because I've traced it to the root cause and it's a null pointer dereference inside the calibrator.

gaikwadrahul8 commented 2 weeks ago

@tagunil, If you have a workaround to address this issue we welcome a pull request. Our team will review it carefully and merge it if it aligns with our coding standards.

Hi, @pkgoogle Please take look into this issue. Thank you

pkgoogle commented 2 weeks ago

Hi @tagunil, forgive me if I'm incorrect but isn't:

    @tf.function
    def call(self, inputs):
        windows = tf.shape(inputs)[0]
        height = tf.shape(inputs)[1]
        width = tf.shape(inputs)[2]

        sums = tf.zeros((windows, width), dtype=self.dtype)

        for y in tf.range(height):
            sums += tf.squeeze(tf.slice(inputs, [0, y, 0], [-1, 1, -1]), axis=1)

        return sums

equivalent to

tf.math.reduce_sum(inputs, axis=1, keep_dims=False)

If so, can you use that to get past this issue?

tagunil commented 2 weeks ago

Hi @pkgoogle,

It's just a simplest reproducer for the crash I could think of. Of course, you are right in that particular case, but, unfortunately, our real model is much more complicated and cannot be implemented without a while op. We want to quantize it partially, but we still need to run the calibrator over the whole model for that.

pkgoogle commented 1 week ago

I was able to replicate with your original code... colab gist here for convenience.

I attempted to do a an AI-Edge-Torch conversion instead to see if it could solve this:

import ai_edge_torch
import torch
import torch.nn as nn
import torch.nn.functional as F 

WINDOWS = 1
HEIGHT = 32
WIDTH = 32

class CustomModel(nn.Module):
    def __init__(self, input_shape):
        super().__init__()
        self.windows = input_shape[0]
        self.height = input_shape[1]
        self.width = input_shape[2]
        self.dense = nn.Linear(self.width, 1)

    def forward(self, x):
        accumulation = torch.zeros(self.windows, self.width)
        for i in torch.arange(self.height):
            accumulation += torch.squeeze(x[:,i,:])
        x = accumulation
        x = self.dense(x)
        x = F.sigmoid(x)
        return x

model = CustomModel((WINDOWS, HEIGHT, WIDTH))
sample_input = (torch.randn(WINDOWS, HEIGHT, WIDTH),)

edge_model = ai_edge_torch.convert(model.eval(), sample_input)
edge_model.export("while.tflite")

This runs into a dynamic slicing issue with PyTorch Export, I'm guessing because most people would do the reduce sum operation instead. @tagunil, I would take a second look to see if perhaps you can vectorize your code and avoid a while loop/range. @vamsimanchala, can you please take a look? Thanks.

tagunil commented 1 week ago

@pkgoogle, unfortunately, our use case is pretty specific and it would take a lot of effort to rework it in a vectorized way, even if possible. By the way, we've worked around the issue by adding a null pointer check to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/optimize/calibration/calibrator.cc#L267. I would make a pull request already if the open source contribution policy in our company were not so uncertain, although we're trying to clarify it.

tagunil commented 1 week ago

And yes, dynamic slicing is another pain point, but for now we work it around by using tf.slice directly instead of relying on Python slicing.

pkgoogle commented 1 week ago

Hmm... a check is probably correct ... being optional doesn't mean in all cases it's not "loggable", but it does mean that that tensor sometimes is not allocated.

Edit: Actually just below, optional tensors are not loggable: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/optimize/calibration/calibrator.cc#L283 ... so root solution will be different ... @tagunil, do you have a stack trace? Feel free to censor/change PII.

tagunil commented 1 week ago

Hi @pkgoogle,

The problem with the optional tensor logic in this particular case is that the list of loggable tensors is created before the while op decides to deallocate one of its tensors (the deallocation happens at the prepare stage). So either we need to create the loggable list later, just between prepare and eval, or we need to add an additional check before trying to actually log the tensor. The latter option works at least in our case.

Of course, I can make a stack trace for you later today or tomorrow.

tagunil commented 1 week ago

@pkgoogle, so here's a trace from a debug build of TF v2.17.0, up to the Python wrapper. Let me know if you want me to capture anything else.

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffe55db86c9 in tflite::optimize::calibration::MinMax::Update (this=0x55555a146838, values=0x0, tensor_size=1024, error_reporter=0x55555a1476b0) at tensorflow/lite/tools/optimize/calibration/calibration_logger.cc:36
36      tensorflow/lite/tools/optimize/calibration/calibration_logger.cc: No such file or directory.
(gdb) bt
#0  0x00007ffe55db86c9 in tflite::optimize::calibration::MinMax::Update (this=0x55555a146838, values=0x0,
    tensor_size=1024, error_reporter=0x55555a1476b0)
    at tensorflow/lite/tools/optimize/calibration/calibration_logger.cc:36
#1  0x00007ffe55d06928 in tflite::optimize::calibration::Logger::LogTensorValue (this=0x55555ad07bf0,
    subgraph_index=0, tensor_index=10, tensor_values=0x0, tensor_size=1024, error_reporter=0x55555a1476b0)
    at ./tensorflow/lite/tools/optimize/calibration/calibration_logger.h:56
#2  0x00007ffe55cfff61 in tflite::optimize::calibration::(anonymous namespace)::LoggingEval (context=0x55555ae0c578,
    node=0x55555b28a1b0) at tensorflow/lite/tools/optimize/calibration/calibrator.cc:266
#3  0x00007ffe55f96c4e in tflite::Subgraph::OpInvoke (this=0x55555ae0c550, op_reg=..., node=0x55555b28a1b0)
    at tensorflow/lite/core/subgraph.cc:1428
#4  0x00007ffe55f977fa in tflite::Subgraph::InvokeImpl (this=0x55555ae0c550) at tensorflow/lite/core/subgraph.cc:1724
#5  0x00007ffe55f971a9 in tflite::Subgraph::Invoke (this=0x55555ae0c550) at tensorflow/lite/core/subgraph.cc:1617
#6  0x00007ffe55e7a7c1 in tflite::impl::Interpreter::Invoke (this=0x55555b3f02c0)
    at tensorflow/lite/core/interpreter.cc:242
#7  0x00007ffe551dc182 in tflite::calibration_wrapper::CalibrationWrapper::FeedTensor (this=0x55555aa6b990,
    input_value=0x7fff3812ae80) at tensorflow/lite/python/optimize/calibration_wrapper.cc:499
#8  0x00007ffe551a0340 in pybind11_init__pywrap_tensorflow_lite_calibration_wrapper(pybind11::module_&)::$_7::operator()(tflite::calibration_wrapper::CalibrationWrapper&, pybind11::handle&) const (this=0x55555b050578, self=...,
    input_value=...) at tensorflow/lite/python/optimize/calibration_wrapper_pybind11.cc:77
pkgoogle commented 1 week ago

Thanks @tagunil, my current thinking is to remove the tensor from the loggable collection when we deallocate to keep the state consistent, that or reorder things (but that may cause further butterflies).

tagunil commented 1 week ago

Well, @pkgoogle, you obviously know better. I'm just saying that a simple null pointer check before logging the tensor also seem to work.