ni / grpc-labview

gRPC client and server support for LabVIEW
MIT License
91 stars 62 forks source link

Calling gRPC Client VIs from TestStand hang at "Wait for Occurrence" while using LabVIEW RTE #330

Open kt-jplotzke opened 12 months ago

kt-jplotzke commented 12 months ago

I've found a condition where calling gRPC client VIs from TestStand, while using the LabVIEW Run Time Engine adapter, causes the "Client Unary Call" VI to hang at "Wait on Occurrence". I can confirm that the gRPC request does get sent from the client prior to hanging and that the server does respond to the request. This issue does not occur when running using the LabVIEW Development Environment adapter.

Steps that I've used to create a reproducible example are below. I've also uploaded all of this code to https://github.com/kt-jplotzke/grpc-labview-teststand-hang.

I've created a small gRPC service with a single function, "Wait". This function simply wraps the LabVIEW Wait (ms) function. I then used grpc-labview 1.0.1.1 to generate the server and client code for this service.

service TestService {
    rpc WaitMs(WaitRequest) returns (WaitResponse);
}

message WaitRequest {
    uint64 msec_to_wait = 1;
}

message WaitResponse {
    uint64 timestamp = 1;
}

For the server, I simply wrapped the Wait (ms) function in the "Start Sync" function. This is the only gRPC method. I then built Run Service.vi as an EXE to allow this to run in the background. image

For the client, I simply wrapped my gRPC client function into a VI which calls Client Create, the gRPC client unary function, and then Client Destroy. I build this function into a Packed Project Library. image

I can confirm that both the native wrapper VI and the build PPL wrapper VI function as expected when executed from the LabVIEW Dev Environment. Both send a request and receive a response from the built gRPC server.

Then, I created a TestStand sequence with a single step -- to call this gRPC client wrapper VI. image

When this sequence is executed using the LabVIEW Development Environment as the LabVIEW adapter, the step executes successfully and returns the response. However, when this sequence is executed using the LabVIEW Run Time Engine, the gRPC step hangs indefinitely. Adding debug code into the Client Unary Call.vim VI, I have been able to prove that the hang is in the Wait On Occurence.vi. When I implement the patch mentioned in #193 (to connect the "timeout (ms)" control to the "Wait on Occurence" VI), the call no longer hangs in TestStand, but an error (-1004) is always generated and no data is returned.

I have also proven that the gRPC request is sent out by the client and that a response is sent back from the server. It appears that either the client is not appropriately triggering the occurrence or receiving the response successfully. However, given that this call is 100% successful when running using the LabVIEW development environment adapter, I feel like it is more likely the former than the latter.

I also believe that this is not related to the specific issue described in #193 as that issue refers to a potential race condition when responses are received quickly. In my case, I am passing in a 1000 msec delay into my gRPC function to purposely slow down the response.

This is currently a blocking issue for me. Please let me know any other information you need to help reproduce / help diagnose. Thank you!

AB#2598217

AndrewHeim commented 12 months ago

@kt-jplotzke - You mentioned in the other thread that you were using local loopback. Are you running local loopback in this case as well, or a device across the network?

If it's loopback, does the problem replicate when the client is run from a different device?

kt-jplotzke commented 12 months ago

@AndrewHeim - Yes, I initially tested just running the server on the same system as the TestStand client using the local loopback connection. I just tested running the gRPC server on a different host and I get the same results, regardless of using localhost or a separate machine.

kt-jplotzke commented 12 months ago

As another data point, I modified the Client Unary Call.vim VI to both wire the timeout to Wait on Occurrence.vi and to ignore the error out of Wait on Occurrence.vi: image

When I build a new PPL with this modification, the call does succeed from TestStand using the LabVIEW RTE adapter -- and returns the correct data from the gRPC server. However, the call takes the full timeout period. The occurrence never fires, but instead times out with the given timeout. But, the following CompleteClientUnaryCall2 DLL function call that's after Wait on Occurrence does return valid data.

This makes me think that this issue is specifically related to the DLL not firing the occurrence successfully.

kt-jplotzke commented 12 months ago

I was able to determine the root cause of this. The issue is is that TestStand can launch multiple versions of the LV RTE in the background -- For example, even though I'm using LV 2021 SP1 and TS 2021, TestStand uses both the LV2021 and LV2023 RTEs in the background - I can see this in a process explorer:

image

However, when a grpc client or server is created in the grpc-labview DLL, the DLL dynamically gets a handle to the Occur() function in either LabVIEW.exe or lvrt.dll. It does this using Window's GetModuleHandle function from here:

auto lvModule = GetModuleHandle("LabVIEW.exe");
        if (lvModule == nullptr)
        {
            lvModule = GetModuleHandle("lvffrt.dll");
        }
        if (lvModule == nullptr)
        {
            lvModule = GetModuleHandle("lvrt.dll");
        }

The issue is that the GetModuleHandle() is not reliable if multiple modules are loaded with the same name. While using standard LabVIEW-only applications, only one lvrt.dll is loaded under the process. However, when using TestStand, multiple lvrt.dll can be loaded (as seen above), which causes this function to return a random handle. In my case, sometimes the LV2023 RTE handle was used by the DLL, which was not the RTE waiting for the occurrence to fire (as my code is using the LV2021 RTE).

To resolve this, I built a new DLL which provides an exported function to define a path to a specific LVRT module. I call this function in my code prior to calling any other gRPC functions. To this function, I pass in the path of the LV2021 RTE:

image

Using this method, my issue is resolved and the occurrence fires every time. While maybe not the most elegant solution, I am submitting a push request of these change. In my mind, the 'Set LVRT Module Path' function is not used by the generated code at all. However, this function can be available to a user if they run into the same situation I am.

AndrewHeim commented 12 months ago

Good work!

So as-is, is this library not thread-safe? (Thinking out loud as I work through this...) Well, sort of.

If I had two clients on different ports... as long as I had one instance of LabVIEW, it would be ok as they would reference the same DLL.

If I had two separate exes, each would have its own execution environment and each would reference the DLL, and it would be ok. At least as long as they're pointed at the same version, that is. I can't say beyond that.

Interesting that to me this seems to be mostly specific to running multiple runtime engines (that will open different copies of the DLL) in the same application. Which we are unlikely to see outside of TestStand. Or... would they collide if they were in separate applications? I would need to do more homework to say for sure.

kt-jplotzke commented 12 months ago

I don't see them colliding if they are in different applications. Each copy of the gRPC DLL would have its own memory space. The call to GetModuleHandle() to find what lvrt.dll to use only returns modules in the currently executing process. So, each LabVIEW application would be in its own process, have its own copy of the gRPC DLL, each referring the correct lvrt.dll version.

In my opinion, this edge case would only manifest itself in TestStand, where a single process can call multiple RTEs at once.

ShockHouse5 commented 10 months ago

This would likely be resolved by setting the "Version Independence" flag in TestStand for the LabVIEW RTE Adapter. Then TestStand will only use one LabVIEW Version for RTE (the latest you have installed), instead of multiple.

kt-jplotzke commented 10 months ago

@ShockHouse5 Potentially if only running from the TestStand development environment, but potentially not if running from a custom LabVIEW operator interface.

In my case, my custom OI is built with LabVIEW 2021, and thus uses that RTE. When running from an OI, the LabVIEW RTEs that are called by TestStand run under the OI's process. So, setting the "Version Independence" flag in my setup caused both the LV2021 RTE (for the OI) and the LV2023 RTE (for the version independent teststand calls) under a single OI process, which still caused the DLL callback issues.

ShockHouse5 commented 10 months ago

@kt-jplotzke Yeah correct. Unless your EXE is also built with "Allow future versions to run this". If that is checked in your exe, and "Version Independence" in TestStand, they will always match because they will both use the latest version.

rdecarreau commented 9 months ago

Loose relation to #193 and #324

nischalks commented 9 months ago

Released as part of https://github.com/ni/grpc-labview/releases/tag/v1.2.1.1