Determining which tag read/write operation a warning in the log is associated with

bcurless1 commented 12 months ago

I've been troubleshooting some issues I'm having with a certain .ACD file while using FT LogixEcho (These issues don't appear to happen on a real PLC, but haven't tested it too much). And in doing so I enabled logging in my application and am seeing many warnings related to "An operation is already in flight!", "Size to set must be a positive number!", and "Tag has no data!". At this point I am wondering how do I determine which tag read/writes are causing this issue? I've got lots of threads hitting the controller simultaneously, so my log messages aren't very clear. Is there a specific log message to look out for? I haven't seen anything but there is a lot going on when I change the level to spew... Also wanted to add, thanks for all the hard work on this. Everything has been working out for me really well so far, and I'm hoping that this issue is just isolated to the emulator as it seems to be.

Update: It seems that I can pretty easily fix all the "An operation is already in flight" by filtering out distinct members before I queue up reads/writes. Definitely curious what is causing the other two warnings though and if there is reason for concern.

timyhac commented 12 months ago

1: How to determine which tag a certain log message is associated with

If you're just using the NativeImport library then you can associate find out which tag from the tag_id and use your lookup table to search for the relevant information. The API docs for the core library are here.

The primary package keeps track of the tag_id internally but unfortunately does not expose it. - the thinking was that we wanted to abstract away from the core library, and passing through a handle would break encapsulation. Logging does break this encapsulation however, but it is also very performance sensitive so we have a tradeoff.

Presumably you're using the [LibPlcTag.LogEvent APIhttps://github.com/libplctag/libplctag.NET/blob/master/src/libplctag/LibPlcTag.cs#L86) to log all events - but you can also set the DebugLevel for individual tags - so you could gradually turn the DebugLevel up for more and more tags until you start seeing the message. https://github.com/libplctag/libplctag/wiki/API#debugging https://github.com/libplctag/libplctag.NET/blob/master/src/libplctag/TagOfT.cs#L131

2: How to analyze/fix the warning messages

In terms of the Tag has no data warning - it looks like this would be associated with an error return code from the core library. So as long as that is being dealt with - presumably you'll be able to find this quite easily.

For the "Size to set must be a positive number" warning I'm not sure how to go about debugging this. Getting some more info on the details of the tag might illuminate the problem.

bcurless1 commented 12 months ago

Thanks for the quick response. I'm using the .Net Core wrapper so I don't think I can access the lookup table. However I think I was able to narrow in on the issue I was having. I had some code that was issuing parallel ReadAsync's to the same tag instance. The behavior I was seeing was either an ErrorTimeout (which I think also coincides with when libplctag logs out "An Operation is Already In Flight"), or more concerning, I'm seeing my calls to ReadAsync never actually return on occasion as well. I was able to quickly fix the issue by removing duplicate tags before I issue my parallel reads, but was wondering if this would be the expected behavior. I've only just started looking at the NativeTagWrapper, but it seems like it could be some race condition regarding the TCS, completer, and timeout. I can try and dig in some more to the library now that I have a test case that fails if you think this is unexpected behavior. Also here is my example that reproduces the issue for me. I'm pretty competent w/ git in general but haven't used github much. It seems that I don't have access to create a new branch, but let me know the best way to proceed on working on this issue if deemed relevant.

        public async Task ReadAsync_When_Called_On_SameTag_Simultaneously_Should_Not_Fail()
        {
            var myTag = new Tag<DintPlcMapper, int>()
            {
                //Any valid tag here should cause this issue
                Name = "PROGRAM:MD_Production._RCP_Prod.GluePointA.GlueSettle",
                Gateway = "127.0.0.2",
                Path = "1,0",
                PlcType = PlcType.ControlLogix,
                Protocol = Protocol.ab_eip,
                Timeout = TimeSpan.FromMilliseconds(1000),
            };

            await myTag.InitializeAsync();

            int repetitions = 100;

            for (int jj = 0; jj < repetitions; jj++)
            {
                Task.WhenAll(
                    myTag.ReadAsync(),
                    myTag.ReadAsync()
                    ).GetAwaiter().GetResult();
            }
        }

bcurless1 commented 12 months ago

I mistakenly clicked Close Issue thinking it was a drop down list...Once again, new to github.

timyhac commented 12 months ago

Given you're in an async context anyway, why use 'GetAwaiter().GetResult()`? It would be more appropriate to use await WhenAll(...)

timyhac commented 12 months ago

Alrighty - I've had a chance to set up the test case you proposed and it throws an ErrorTimeout on the second iteration.

Using ab_server as the device with: .\ab_server.exe "--plc=ControlLogix" --path=1,0 --tag="MyTag:DINT[1]"

[Fact]
public async Task ReadAsync_When_Called_On_SameTag_Simultaneously_Should_Not_Fail()
{
    var myTag = new Tag<DintPlcMapper, int>()
    {
        //Any valid tag here should cause this issue
        Name = "MyTag[0]",
        Gateway = "127.0.0.1",
        Path = "1,0",
        PlcType = PlcType.ControlLogix,
        Protocol = Protocol.ab_eip,
        Timeout = TimeSpan.FromMilliseconds(1000),
    };

    await myTag.InitializeAsync();

    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync()); // Throws here
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
    await Task.WhenAll(myTag.ReadAsync(), myTag.ReadAsync());
}

If you set a breakpoint and step through, it doesn't throw - so there is some kind of time/state based effect happening - you could be right that there is some race condition happening here.

I'm not sure if this is the expected behaviour or not - I guess the fact that it happens only on the second call probably indicates that it isn't.

bcurless1 commented 11 months ago

Thanks for confirming that you have the same issues. I've had a little bit of time to dig more. It appears that what happens is if you issue another read before an in progress one has finished, then the call to the native library plc_tag_read immediately returns with error code -39 signifying PLCTAG_ERR_BUSY. I see this coincide with the log warning of "An Operation is Already in Flight!". With this knowledge, we would know that the tag read will never return as the native library never actually issued the read, and thus will always time out in the .NET wrapper. Would the most appropriate thing to do in this case be to not push a new TaskCompletionSource on the ConcurrentStack, and try and return when the previous call on the stack returns? Does that make sense? I'm also getting another "ErrorNotFound" error now too randomly with this test case, but I haven't figured anything out on it yet.

kyle-github commented 11 months ago

There are two possible responses from the native lib when you try to do a read or write while an operation is already in flight. If you use the async/non-blocking mode, then you will (should!) get PLCTAG_ERR_BUSY. If you use multiple threads and one calls an operation with a wait time (synchronous mode), then the second thread tries, the second thread will wait for a mutex. I think. Now that I am thinking about this more, I'll need to double check what the code does.

bcurless1 commented 11 months ago

It appears that libplctag is doing what is expected, there are just some threading issues in the .NET wrapper ReadAsync() function. The two main issues that I see are that if an operation is already in flight, it seems that we should TryPeek on the stack of read tasks to see if there is still a read operation in progress. If it is, await that task, if there is not, you can assume a read just went through and you can just return. The second issue is around threading, it appears that locking around the critical section where a read task is created, and then finally pushed on the stack is required as well as the same lock being around the ReadTaskCompleter. This is because there is a chance that plc_tag_read is initiated before the read task is actually pushed onto the stack. LibPlcTag could have already signaled that the read was complete in the ReadTaskCompleter, and we would fail the TryPop and do nothing, resulting in "ErrorTimeout". I'll paste my rough fix below, let me know what you think.

        public async Task ReadAsync(CancellationToken token = default)
        {
            ThrowIfAlreadyDisposed();

            using (var cts = CancellationTokenSource.CreateLinkedTokenSource(token))
            {
                cts.CancelAfter(Timeout);
                await InitializeAsyncIfRequired(cts.Token);

                using (cts.Token.Register(() =>
                {
                    if (readTasks.TryPop(out var readTask))
                    {
                        Abort();

                        if (token.IsCancellationRequested)
                            readTask.SetCanceled();
                        else
                            readTask.SetException(new LibPlcTagException(Status.ErrorTimeout));
                    }
                }))
                {
                    Task<Status> waitTask = null;
                    lock (readLock)
                    {
                        var readTask = new TaskCompletionSource<Status>(TaskCreationOptions.RunContinuationsAsynchronously);
                        var rc = _native.plc_tag_read(nativeTagHandle, TIMEOUT_VALUE_THAT_INDICATES_ASYNC_OPERATION);
                        Console.WriteLine("Read returned " + rc);
                        if (rc == -39)
                        {
                            //A read is already in flight, wait for it to complete
                            if (readTasks.TryPeek(out var prevReadTask))
                            {
                                waitTask = prevReadTask.Task;
                            }
                        }
                        else
                        {
                            readTasks.Push(readTask);
                            waitTask = readTask.Task;
                        }
                    }
                    if (waitTask != null)
                    {
                        await waitTask;
                        ThrowIfStatusNotOk(waitTask.Result);
                    }
                }
            }
        }

        private void ReadTaskCompleter(object sender, TagEventArgs e)
        {
            lock (readLock)
            {
                if (readTasks.TryPop(out var readTask))
                {
                    switch (e.Status)
                    {
                        case Status.Pending:
                            // Do nothing, wait for another ReadCompleted callback
                            break;

                        default:
                            readTask?.SetResult(e.Status);
                            break;
                    }
                }
            }
        }

kyle-github commented 11 months ago

Is that overall algorithm racy?

Thread 1 starts an operation and does await.

Thread 2 tries and gets PLCTAG_ERR_BUSY from the native lib.

Thread 1's op completes.

Thread 2 checks and sees the existing op and tries to wait on the first op. But that op is done.

Is there a way to order all the different parts such that Thread 2 waits for something that never completes or for it to somehow starve?

timyhac commented 11 months ago

It does look like there is an issue but I think there is a question around what do we want the behaviour to be.

Should the second read task fail with ERR_BUSY or should it return the existing awaitable Task?

I'm thinking the first one because that mirrors the behaviour of libplctag core library. Its also makes it clear to the consumer what is happening rather than potentially indicating that a Read has been triggered (and returned really quickly).

What is the use case for reading a single tag twice concurrently?

bcurless1 commented 11 months ago

So the use case would be that I define a series of tags on the controller, so I only have a single copy of all tags on the controller. There is a single tag on the controller that multiple pieces of functionality may or may not have to touch at the same time, but both pieces of functionality operate independently, and thus don't know about each other.

To try and make it more concrete, say we had a tag representing a temperature of an oven that needed to be updated and displayed both on a maintenance screen for debugging purposes as well as be monitored by a separate routine to determine if the temperature is in an alarm state (out of range). The alarm monitoring and the maintenance screen don't and shouldn't have to know about each other, and may or may not be reading that same temperature tag at the same time on different threads. It seems that the only way to get around this would be to create duplicate tags, let me know if there is another way you would suggest here.

Trying to think through the scenarios, if you return ERR_BUSY, likely the calling code would just have to retry and issue another read, consuming more bandwidth, when there coincidentally already was a read queued up. Would there be a situation where this isn't relevant and you would want to force a separate read?

If you don't think returning the previously initiated read is a good idea, I think erroring when an operation is already in flight would definitely be an improvement too. Are there any other scenarios where the core library could fail to read and complete in the callback? If so then we would still want to handle the issue where ReadAsync() can fail to timeout if the plc_tag_read call fails to come through, and a context switch comes in after setting up the Timeout callback, but before the readtask is pushed onto the stack,

kyle-github commented 11 months ago

If you are going to consistently read the tags, perhaps it would be better to set them up as automatically read tags. Then you do not need to lock or issue read triggers at all. Just pull the data out of the tag when you want to.

timyhac commented 11 months ago

Trying to think through the scenarios, if you return ERR_BUSY, likely the calling code would just have to retry and issue another read, consuming more bandwidth, when there coincidentally already was a read queued up. Would there be a situation where this isn't relevant and you would want to force a separate read?

These are good points - I can recommend having a component that manages the communication to the PLC. This will allow you to coordinate the communication, cache values, manage retries, logging, behaviour when the PLC goes offline, etc. A key benefit for development is that you can also mock this as a unit for simulation/testing.

In terms of if there is ever a situation that you would want to have some guarantee that you have kicked off another read - I agree this isn't relevant to a SCADA-type application, but if you are attempting control of the machine you will have some processes that run in serial - read tag A, write tag B, read tag C, write tag D. In this situation you will want to know for sure that you have start one operation only after the previous one finished.

timyhac commented 11 months ago

If you don't think returning the previously initiated read is a good idea, I think erroring when an operation is already in flight would definitely be an improvement too. Are there any other scenarios where the core library could fail to read and complete in the callback? If so then we would still want to handle the issue where ReadAsync() can fail to timeout if the plc_tag_read call fails to come through, and a context switch comes in after setting up the Timeout callback, but before the readtask is pushed onto the stack,

Yes we'll have to look at the detail of these scenarios but in general I think we have to assume that libplctag is doing the correct thing - if not we escalate that to the core project and fix it there.

bcurless1 commented 11 months ago

How can I help with this? What do you think the path forward is?

timyhac commented 8 months ago

I think the first step would be to set up unit test cases that reflect these scenarios so that we can code towards them.

libplctag / libplctag.NET

Determining which tag read/write operation a warning in the log is associated with #346

1: How to determine which tag a certain log message is associated with

2: How to analyze/fix the warning messages