ni / grpc-labview

gRPC client and server support for LabVIEW
MIT License
85 stars 60 forks source link

LabVIEW crashes when serializing nested messages #384

Open jasonmreding opened 1 week ago

jasonmreding commented 1 week ago

See https://github.com/ni/measurement-plugin-labview/issues/594 for a description of the crash and how to reproduce. The proto API for the crash is here.

After doing a bit of debugging and trial and error, the crash only happens if the service reports results where the annotations field is populated in the response. Also, setting EfficientMessageCopy = FALSE in the ini file also fixes the crash. The crash is being triggered from the following DAbort in LV:

TH_REENTRANT size_t _FUNCC DSGetHandleSize(const void* h)
{
    size_t size;

    if (!GoodDSHandleOpt(h)) {
        gDAbortID(0xF50EFD7BUL);
        return 0;
    }
    size = BGetHandleSize(gMemZone, h);
    return size;
}

This seems to be the result of #335 and #352. I suspect #378 is also related to this.

AB#2864684

jasonmreding commented 1 week ago

@pratheekshasn @ni-sujain @CPattar-NI @yash-ni

I have been debugging this issue a bit, and I'm having a hard time understanding how the code works or what this change was intended to provide over the original implementation. I guess it was purely a performance improvement? From what I can see, this code path doesn't appear to be assigning data to the cluster from memory allocated by LV. Instead, it looks like it is copying data from the LVMessage itself via a google::protobuf::RepeatedField<char> buffer. If that's the case, then it's not surprising we are running into a crash, but maybe I just don't understand the code. It also appears the NumericArrayResize is allocating 8x the amount of memory necessary for a repeated nested message. However, that is happening in both code paths (with and without the toggle enabled) so maybe I'm misreading the code. While that is wasteful, it is not inherently incorrect from a functional standpoint. However, it might be indicative of other problems lingering in the code and code that isn't quite ready for release.

Given the number of issues already caused by this change and the complexity of the code, I'm wondering if we should just disable the feature or if you have any other insights/suggestions for how to fix the current, active code path? I guess this change was made as a performance optimization? If so, I don't have any background on how crucial the optimization was or what impact rolling it back might have on clients.