jcurl / RJCP.DLL.SerialPortStream

SerialPortStream is an independent implementation of System.IO.Ports.SerialPort and SerialStream for better reliability and maintainability. Default branch is 2.x and now has support for Mono with help of a C library.
Microsoft Public License
628 stars 197 forks source link

Intermittent serial port issue - no data is written. #47

Closed BernicusMaximus closed 6 years ago

BernicusMaximus commented 6 years ago

Great library! Thanks for building this for the community. I'm using this in a project and am running into a very strange issue. We have a windows service running on a 64bit Windows 10 Enterprise machine that maintains an open SerialPortStream connection to a peripheral. At seemingly random intervals our calls to SerialPortStream.Write fail to send data over the COM port. The method call does not fail and no errors are returned. I've turned on diagnostic tracing for the SerialPortStream library and can see when it switches from writing data to failing to do so. I have attached a sample log below. Any ideas? Your insight is greatly appreciated.

LogSample.txt

Just a note, I do not have a command timeout set on this SerialPortStream.

jcurl commented 6 years ago

Interesting case and thank you for the logs. What serial device are you using? Do you have hardware flow control active? What USB are you using?

The last working case looks like:

IO.Ports.SerialPortStream Verbose: 0 : COM5: SerialThread: DoWriteEvent: WriteFile(996, 48220016, 8, ...) == False
    DateTime=2018-02-04T17:42:21.8490340Z
IO.Ports.SerialPortStream Verbose: 0 : COM5: SerialThread: ProcessWaitCommEvent: EV_TXEMPTY
    DateTime=2018-02-04T17:42:21.8490340Z
IO.Ports.SerialPortStream Verbose: 0 : COM5: CommEvent: EV_TXEMPTY
    DateTime=2018-02-04T17:42:21.8490340Z
IO.Ports.SerialPortStream Verbose: 0 : COM5: SerialThread: ProcessWriteEvent: 8 bytes
    DateTime=2018-02-04T17:42:21.8490340Z
IO.Ports.SerialPortStream Verbose: 0 : COM5: SerialThread: ProcessWriteEvent: TX-BUFFER empty
    DateTime=2018-02-04T17:42:21.8490340Z

The failing case however is

IO.Ports.SerialPortStream Verbose: 0 : COM5: SerialThread: DoWriteEvent: WriteFile(996, 48220024, 8, ...) == False
    DateTime=2018-02-04T17:42:28.6656425Z
IO.Ports.SerialPortStream Verbose: 0 : COM5: SerialThread: ProcessWriteEvent: 0 bytes
    DateTime=2018-02-04T17:42:29.1657180Z

It tells me that Windows didn't immediately complete the request (OK, this happens everywhere), but it returns with a timeout of about 500ms with no data sent.

result = UnsafeNativeMethods.GetOverlappedResult(m_ComPortHandle, ref writeOverlapped, out bytes, true);
if (!result) {
  ...
} else {
  ProcessWriteEvent(bytes);
}

So the logs are saying with ProcessWriteEvent: 0 bytes that the GetOverlappedResult is finished (result was true), however no bytes were sent (bytes=0 shown in logs), and we don't see the COMM_EVENT indicating TX_EMPTY.

If you're using a 16550A chipset (really ancient, probably not), I'd check if you have flow control active.

If you are using a USB chipset, can you try different drivers (note, not always the newest is the best, as I found out with some)? I've found FTDI to be the most reliable. Check the USB cables for quality (some long cables, cheap hubs cause problems). Try exchanging the USB/SER adapter if you have better luck.

My tests have included running for days automatic tests sending via the serial port using FTDI on testing external embedded devices.

BernicusMaximus commented 6 years ago

Thanks for the info! We're using an industrial computer by a company called Qotom. We have sent an email to them to determine which chipset they are using, but I don't think it's the 16550A.

We do not have hardware flow control active. We're connecting from this Qotom computer to another embedded single board computer from Eurotech. These devices are just communicating with a simple 3 line (TX,RX,GND) serial cable, very low-tech.

I too have been running the tests for days (weeks actually) and have had no issues in our lab, only in production. :)

Is there a way to get the actual count of bytes written without modifying your library too much? That way if I see zero bytes written a few times in a row I can try some alternative handling while we're trying to get to the root cause. The issue seems to be resolved when we restart our windows service.

Thanks again for the help. I will post again when I have more info from the PC manufacturer or I find a resolution.

jcurl commented 6 years ago

You could check the property BytesToWrite. If the value doesn't decrease between two reads then data isn't being sent. The BytesToWrite checks how much data is still in the memory buffer that hasn't been successfully sent plus any data that the serial driver itself reports as being present. The property will call UnsafeNativeMethods.ClearCommError(m_ComPortHandle, out cErr, out comStat); which may have the side effect also of clearing any (unreported) errors in the serial driver. Give it a try!

If it's an embedded device, I would expect that it would be a 16550A compatible directly built in to Intel's LPC bus. You can certainly check to see what drivers are installed. For example, I see my 16550A on my laptop (it's old) is:

But yours is a COM5 and unlikely to be a 16550A (that would be a COM1 or COM2, perhaps COM3 or COm4 with IRQ sharing).

It would also suggest that closing and opening the serial port somehow resolves the issue too.

BernicusMaximus commented 6 years ago

Thanks for the continued assistance. We finally got it to fail in our lab this morning after running for 5 days. We also managed to get a software serial port monitor by Eltima hooked up to it while it was in the failed state. We took a snapshot when it was in a failed state:

odincomportsession_screenshot_failed

and when it was in a working state:

odincomportsession_screenshot_working

Looks like its getting a STATUS_TIMEOUT during the IRP_MJ_WRITE call. It seems like the buffer is filling as you mentioned - if you notice the data length is 84488 during the write period. For whatever reason, the buffer didn't fill up beyond this amount upon subsequent calls to the IRP_MJ_WRITE calls.

We are still working with the hardware manufacturer to understand what we can do on the hardware side. I will post more info as I get it. Hopefully this will help anyone having a similar issue.

jcurl commented 6 years ago

The Eltima logs certainly match the logs in the SerialPortStream. I saw the same, that eventually we just get the Write is finished with zero bytes. The 84488 is what is in the local buffer which it attempts to write. The assumption in code is that if it's too big, the driver should only send a partial amount and say how much it actually wrote.

I remember when designing the original code, PL2303 chipsets had problems sending too much data.

You could limit the size of the writes to something lower such as 4096 bytes by modifying the function CommOverlappedIo.cs::DoWriteEvent (line 747).

However, in the original logs you sent, you were only sending about 8 bytes every few seconds, so it suggests that the serial port wasn't sending data for probably 20 minutes or so.

jcurl commented 6 years ago

Assuming this is a driver issue, no feedback for some time. Closing

BernicusMaximus commented 6 years ago

Sorry I went quite. I forgot to come back and give an update until today. We just closed this issue a few weeks ago. We determined that it was not a driver issue, but an IRQ sharing issue. This computer that we are using has 6 built-in serial ports. The first two serial ports (COM1 & COM2) had their own IRQ setup (IRQ3 and IRQ4 respectively) , but the last four serial ports (COM3 - COM6) were sharing an IRQ (IRQ7). We found this irregularity when inspecting port details in the device manager. We could never actually prove it, but we strongly suspected that there were IRQ collisions in the hardware which were causing unpredictable behavior in the Windows serial ports. We worked with the manufacturer to get a BIOS update such that each serial port would have its own IRQ. After updating machines in our lab we tested for about a month and no longer had the issue. We've fixed the machines in the field and they have been running without issue for several weeks. Every day that passes without issue gives us more confidence that this was truly the problem. Thank you again for all of the support. We are very glad we found your library!!