jcurl / RJCP.DLL.SerialPortStream

SerialPortStream is an independent implementation of System.IO.Ports.SerialPort and SerialStream for better reliability and maintainability. Default branch is 2.x and now has support for Mono with help of a C library.
Microsoft Public License
614 stars 196 forks source link

Upgrade from 2.2.2 to 2.3.1 makes my app hang #117

Closed maxbl4 closed 3 years ago

maxbl4 commented 3 years ago

It feel like your update to 2.3.1 made async read/write operation actually work in sync. The code like: var t = serialPort.ReadAsync(); // or WriteAsync() which previously did not block and returned a task, now blocks

jcurl commented 3 years ago

Thanks for the report. It shouldn't block.

Which runtime are you using? .Net 4.5 or .Net Core?

maxbl4 commented 3 years ago

I just switched to net5.0 and found one of the tests hanging. I could create short repro, but unfortunately it uses custom device, which it talks to

maxbl4 commented 3 years ago

The workaround is to call _ = Task.Run(MethodWhichIsNowSync); That forces new thread and do not block

maxbl4 commented 3 years ago

Tried reverting only SerialPortStream to 2.2.2 and all worked fine

jcurl commented 3 years ago

Thanks. The problem then is with the new task based implementation. Unfortunately, I don't have unit test cases yet, so it's a bit harder to debug. I assume what you write can be tested in only a few lines. I'll look into it as soon as I can, but I suggest you drop down to the older version for now until it is fixed.

jcurl commented 3 years ago

I'll need you to write a test case. The simplest that I have written doesn't show the problem you're describing.

584968046: ReadTask - Getting Task
584968046: ReadTask - Got Task, awaiting
584973046: WriteAsync
584973078: FlushAsync
584973078: await Read Task
584973156: ReadTask - Read 3 bytes
584973171: Done

You'll note there is no delay in getting the Task object, but it does properly await. The project attached is for .NET 5.0 and pulls in SerialPortStream 2.3.1.

while (expected > 0) {
  Console.WriteLine("{0}: ReadTask - Getting Task", Environment.TickCount);
  Task<int> rt = serialPortStreamRead.ReadAsync(buffer, 0, buffer.Length);
  Console.WriteLine("{0}: ReadTask - Got Task, awaiting", Environment.TickCount);
  int r = await rt;
  Console.WriteLine("{0}: ReadTask - Read {1} bytes", Environment.TickCount, r);
  expected -= r;
}

So you can see that getting the task is at time 584968046, and 0ms later it got it, and is now await for data. Now, the test case writes the data 5000ms later, which is reflected by the timestamp when data is received at 584973156 (i.e. there was a delay of 5110ms, which is OK).

Without a reproducible test case, I'm not sure what the problem is (or the fix).

Here's the test code: SPSTEST.zip

The test code here uses a driver that works under Windows 7-Windows 10 called "com0com". You should install that to test the application.

maxbl4 commented 3 years ago

I will have to investigate more. Simple repro also worked for me, so maybe it's some combination of my code with changes in yours

jcurl commented 3 years ago

Have you had a chance to investigate? I've set up the project now to use the SDK format, and tested within .NET Core App 3.1 and all tests pass there too.

maxbl4 commented 3 years ago

Hi, I did a quick fix in my code. Just added await Task.Yield() and it fixed the immediate issue. But I still can reproduce the original issue and here is a minimal repro as I could create: https://github.com/maxbl4/RfidDotNet/blob/serial_port_hang_repro/maxbl4.RfidDotNet.GenericSerial.Tests/SerialUnifiedTagStreamTests.cs Look at the Hang() test, I expect it to pass immediately, since it should return a Task on first await. But actually it will block. If I debug it, I see, that it works fine, loop is going through iterations and I see correct data from my device. If I add Task.Yield() it will not block and work as expected, the same work without Yield on SerialPortStream 2.2.2

PS that repro uses net5.0 and SerialPortStream 2.3.1

jcurl commented 3 years ago

I checked out the repository, and if I run it, it takes 3s and then the test finished with state "PASS". I assume this is because I don't have your specific RFID hardware?

Can you please confirm the line that is actually being blocked? You've got a few await methods. Can you instrument your test case with Console.WriteLine, the amount of data you're seeing? I'd like to see how the loops are executing.

Can you try the ReadAsync method with the CancelToken (just use the None). That's because the ReadAsync you've used wasn't implemented.

When it hangs, are you able to get a core dump? If you're not sure how to do that, I've got some code that can programmatically create a core dump; https://www.nuget.org/packages/RJCP.CodeQuality/ (but I've only tested this with .NET 4.0/4.5).

Else the alternative would be I can start instrumenting the binary with Console.WriteLine (I haven't figured out yet how to get tracing to work in .NET Core), and we can debug from there.

I've attached the most recent version (it's recompiled in debug mode, commit 238753e), if the problem is still reproducible with that. SerialPortStream.2.3.2-beta.20210425.1.nupkg.zip

Also, now that you've got a minimum test case, can you run the test case with .NET 4.5 or later, and tell me if it's reproducible or not?

maxbl4 commented 3 years ago

I could try to update repro to use some common serial device, how do you run your tests? I have some usb-ttl adapters and esp32, from there we could come up with test that runs same for both of us When you run my test it just fails inside loop, because it expects particular response

jcurl commented 3 years ago

The instructions on how I test are documented in master with the file BUILD.md. I run dotnet test.

If you can print out the commands as bytes and the expected responses, I might be able to hack a small test that listens for the request and gives the response, simulating your device.

The actual driver I use is com0com, which simulates a null modem, but I also test with real hardware, FTDI bring the most reliable that I've found. Tested with PL2303 RA on Win10 and Linux.

maxbl4 commented 3 years ago

My device is quite simple. I send a packet which you can see and save in my test, and device respond with packet of 7 to 250 bytes (usually 25-30). The first byte is length of whole packet, as you can see in the code, i read first byte, than read the rest of the packet. That's it. Should be easy to simulate. I will also check your testing instructions tomorrow, maybe will be able to repro without rfid reader

maxbl4 commented 3 years ago

Hi, I have updated repro to use com0com, still works same way as with hardware https://github.com/maxbl4/RfidDotNet/blob/serial_port_hang_repro/maxbl4.RfidDotNet.GenericSerial.Tests/SerialUnifiedTagStreamTests.cs

Looks like the issue is in ReadByte() method adding it make whole method execute synchronously

jcurl commented 3 years ago

Thanks for your analysis. I'll look further into it as soon as I can. I appreciate your patience and help, I can only work on this project on the "side".

jcurl commented 3 years ago

I think I recognize a problem. Let's see if it is "the" problem.

The code:

async Task Client(State state)
{
    //await Task.Yield();
    var port = new SerialPortStream("CNCB0", 57600, 8, Parity.None, StopBits.One)
    {
        ReadTimeout = 3000, WriteTimeout = 200
    };
    port.Open();
    port.DiscardInBuffer();
    port.DiscardOutBuffer();
    var buffer = new byte[] { 6,0,1,4,0,172,54};
    while (state.Sw.ElapsedMilliseconds < 5000)
    {
        await port.WriteAsync(buffer, 0, buffer.Length);
        var packetLength = port.ReadByte(); // adding this line make method blocking and not return Task
        var read = await port.ReadAsync(buffer);
        if (read > 0)
            state.SuccessCount++;
    }
}

Important to note per MSDN, that an Async method can still execute synchronously. So when you call Client, it gets to port.WriteAsync. This method is synchronous with the new implementation if it can write to an empty buffer immediately. This is by design for performance - we do not create a new thread if there is no reason to. That leads to the next method port.ReadByte() actually blocking, because it's still running in the context of the caller.

Now in 2.2.2, the underlying .NET implementation would always "invoke" the Read method on a new thread in the thread pool, even if it only would take a few nanoseconds to execute. The "default" implementation of ReadAsync provided by Stream.ReadAsync didn't know any better, until a "real" implementation was provided in 2.3.1.

This explains why in 2.2.2 you see that it doesn't block (because a thread is always created, this results in the port.ReadByte() being executed on the completion of the thread from the threadpool, via the OnCompleted through the state machine the compiler creates for you in the background.

Now in 2.3.1, it doesn't create a new thread. It just literally copies the data into the buffer. So the task returned by WriteAsync is already completed, there is no thread created, and so everything runs in the caller context, including ReadByte.

To fix this, you may not use any synchronous calls when using Task programming. This is why Task.Yield() fixes the problem.

If you agree, I think this is the root cause of your problem.

jcurl commented 3 years ago

I instrumented the code to print out the Environment.TickCount and also the managed thread id with Environment.CurrentManagedThreadId, and confirmed that for example, the test case starts with TID=14, and it finally exits after 5s and the Client always remains on TID=14. For example:

Test Name:  maxbl4.RfidDotNet.GenericSerial.Tests.SerialUnifiedTagStreamTests.Com0ComRepro
Test Outcome:   Failed
Result Message: Expected state.Sw.ElapsedMilliseconds to be less than 1000L, but found 5542L.
Result StandardOutput:  
171622265: TID=14: Com0ComRepro: Create Device
171622281: TID=14: Device: Enter
171622281: TID=14: Device: Open
171622296: TID=14: Device: DiscardInBuffer
171622296: TID=14: Device: DiscardOutBuffer
171622312: TID=14: Device: ReadAsync
171622312: TID=14: Com0ComRepro: Create Client
171622312: TID=14: Client: Enter
171622312: TID=14: Client: Open
171622312: TID=14: Client: DiscardInBuffer
171622312: TID=14: Client: DiscardOutBuffer
171622312: TID=14: Client: WriteAsync
171622312: TID=14: Client: WriteAsync Done
171622343: TID=13: Device: ReadAsync = 7
171622343: TID=13: Device: WriteAsync
171622343: TID=13: Device: WriteAsync Done
171622343: TID=13: Device: ReadAsync
171622437: TID=14: Client: ReadByte = 1
171622437: TID=14: Client: ReadAsync = 6
171622437: TID=14: Client: WriteAsync
171622437: TID=14: Client: WriteAsync Done
171622453: TID=17: Device: ReadAsync = 7
171622453: TID=17: Device: WriteAsync
171622453: TID=17: Device: WriteAsync Done
171622453: TID=17: Device: ReadAsync
171622468: TID=14: Client: ReadByte = 1
171622468: TID=14: Client: ReadAsync = 6
171622468: TID=14: Client: WriteAsync
171622468: TID=14: Client: WriteAsync Done
...
171627250: TID=15: Device: ReadAsync = 7
171627250: TID=15: Device: WriteAsync
171627250: TID=15: Device: WriteAsync Done
171627250: TID=15: Device: ReadAsync
171627265: TID=14: Client: ReadByte = 1
171627265: TID=14: Client: ReadAsync = 6
171627265: TID=14: Client: WriteAsync
171627265: TID=14: Client: WriteAsync Done
171627281: TID=13: Device: ReadAsync = 7
171627281: TID=13: Device: WriteAsync
171627281: TID=13: Device: WriteAsync Done
171627296: TID=14: Client: ReadByte = 1
171627296: TID=14: Client: ReadAsync = 6
171627296: TID=14: Com0ComRepro: Pausing

That shows my reasoning is correct. Notice that the Device is on a new TID, because the call to ReadAsync has no data available immediately, so it internally creates a new Task to wait for that trigger. That trigger then occurs on the threadpool.

maxbl4 commented 3 years ago

Hi, thanks for investigation. I have replaced ReadByte with ReadAsync in my code, and does nor block anymore. I'm still not fully convinced with your explanation, because it does not sound realistic for a loop of read/write running for several seconds to always complete synchronously, but i have my code working and I can't ask for more :)

jcurl commented 3 years ago

Sure. I learned Async from the book Async in C#. There's a section there called "Async Methods Are Synchronous Until Needed". I'll quote the chapter:

I said before that async methods are only asynchronous if they consume an asynchronous method with an await. Until that happens, they run in the thread that called them, just the same as a normal synchronous method. This sometimes has very real implications, especially when it’s possible for an entire chain of async methods to complete synchronously. Remember that the async method only pauses when it reaches the first await. Even then, it sometimes doesn’t need to, because sometimes the Task given to an await is already complete. A Task can be already complete in these situations: • It was created complete, by the Task.FromResult utility method, which we’ll explore further in Chapter 7 • It was returned by an async method that never reached an await • It ran a genuine asynchronous operation, but has now finished (perhaps because the current thread did something else before awaiting) • It was returned by an async method that reached an await, but the Task it awaited was also already complete Because of the last possibility, something interesting happens when you await a Task that’s already complete, deep in a chain of async methods. The entire chain is likely to complete synchronously. That’s because in a chain of async methods, the first await to be called is always the deepest one. The others are only reached after the deepest method has had a chance to return synchronously. You might wonder why you would use async in the first place if the first or second possibilities happened. If those methods were guaranteed to always return synchronously, you’d be right, and it would be more efficient to write synchronous code than to write async methods with no await. However, there are situations where methods would sometimes return synchronously. For example, a method that cached its results in memory could return synchronously when the result is available from the cache, but asynchronously when it needs to make a network request. You may also want to make methods return Task or Task to future-proof a codebase, when you know there’s a good chance you’d like to make those methods asynchronous at some point down the line.

You're seeing this because no "wait" is needed, so WriteAsync completes synchronously. There's nothing in the standard that says we have to create something on the thread pool.

Because you're using a "while" loop, this could take longer. The correct solution here is probably Task.Yield() as you've found, as you've seen the execution time is theoretically unbounded. My WriteAsync is guaranteed to be at most a copy of memory to a buffer and doesn't block.