jcurl / RJCP.DLL.SerialPortStream

SerialPortStream is an independent implementation of System.IO.Ports.SerialPort and SerialStream for better reliability and maintainability. Default branch is 2.x and now has support for Mono with help of a C library.
Microsoft Public License
624 stars 197 forks source link

System.IO.IOException: DeviceError on Linux #78

Closed geversmann closed 5 years ago

geversmann commented 5 years ago

We are using

We open the ports once. Then a short time for 8 min all is running well. But then we get following errors at SerialPortStream.ReadExisting()/.Write():

System.IO.IOException: Device Error at RJCP.IO.Ports.SerialPortStream.ReadCheckDeviceError(Boolean immediate) in C:\EASE\MetaBridge_Core\Brabender.Common.Hardware5.SerialPortStream\SerialPortStream.cs:line 630 at RJCP.IO.Ports.SerialPortStream.ReadExisting() in C:\EASE\MetaBridge_Core\Brabender.Common.Hardware5.SerialPortStream\SerialPortStream.cs:line 932 at Brabender.Common.Hardware5.BrabSerialPort.SerialPort.ReadExisting() in C:\EASE\MetaBridge_Core\Brabender.Common.Hardware5.BrabSerialPort\SerialPort.cs:line 502

or

System.IO.IOException: Device Error at RJCP.IO.Ports.SerialPortStream.WriteCheckDeviceError() in C:\EASE\MetaBridge_Core\Brabender.Common.Hardware5.SerialPortStream\SerialPortStream.cs:line 1101 at RJCP.IO.Ports.SerialPortStream.WriteCheck(Byte[] buffer, Int32 offset, Int32 count) in C:\EASE\MetaBridge_Core\Brabender.Common.Hardware5.SerialPortStream\SerialPortStream.cs:line 1091 at RJCP.IO.Ports.SerialPortStream.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\EASE\MetaBridge_Core\Brabender.Common.Hardware5.SerialPortStream\SerialPortStream.cs:line 1139 at Brabender.Common.Hardware5.BrabSerialPort.SerialPort.Write(Byte[] buffer, Int32 offset, Int32 count) in C:\EASE\MetaBridge_Core\Brabender.Common.Hardware5.BrabSerialPort\SerialPort.cs:line 397

When we enable tracing then the problem disappears.

So we have debug the problem by a remote session: image image image

jcurl commented 5 years ago

the device error indicates that the I/O thread has stopped due to an I/O error earlier. To debug this, you need to capture traces (preferably verbose, if not, then info) with the information as provided in the wiki. See Tracing. To get to the root cause, it may be necessary to recreate also with "strace" on your system, which shows the results of all system calls.

When providing information about the libnserial, did you compile it yourself, or use a .deb package I've already provided? If you compiled it yourself, the commit ID is useful.

geversmann commented 5 years ago

As mentioned, we have already enabled tracing. We do this in code as following because .NET Core doesn’t supports app.config anymore. image But if we run our application with enabled tracing we don’t get the error anymore.

So we have debugged the problem as already shown. In case of the problem we get an error result at the call “m_Dll.serial_read(m_Handle, (IntPtr)bo, length);” The error code is 5 and the message is “Read-End-Of-File reached”.

Update: When we enable tracing and starting remote debugging then the problem appears too. So we think it is a timing Problem. 1120c_RJCP.log

jcurl commented 5 years ago

I'll reopen this for now, just so I can have the time to look over the logs. The line of code returning the error is here:

events.c: static ssize_t internal_read(struct serialhandle handle, char buf, size_t count)

  readbytes = read(handle->fd, buf, count);
  if (readbytes == 0) {
    serial_seterror(handle, ERRMSG_SERIALREADEOF);
    errno = EIO;
return -1;

So what is happening, the linux driver is returning zero, which should not be possible. The basis of the software is to use the select() system call to know if the read has data available. That it returns saying that data is available, the read() call should never return zero.

Posix defines in this case that zero is an end-of-file condition, so any further reads is expected to return zero because the driver indicates the remote side is closed.

You mention a possible race condition. It normally shouldn't be possible, but could you have two instances of the SerialPortStream accessing the same TTY device? That might show this problem.

An strace of the problem, along with a mapping to the .NET logs could help understand what is going on.

jcurl commented 5 years ago

Hello, are you able to track down if this is a device issue? To confirm, an strace is needed at the time of failure.

geversmann commented 5 years ago

The strace will take some time because our Linux expert is currently not present.

We use multithreaded access to SerialStreamPort, but to different ports. The accesses to each port are already locked at high level. By trial, I have locked each SerialStreamPort call with a static object, without success.

In the log there is always an “abortwatiforevent” before the error appears:

IO.Ports.SerialPortStream Verbose: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: serial_waitforevent(RJCP.IO.Ports.Native.Unix.SafeSerialHandle, ReadEvent) == ReadEvent IO.Ports.SerialPortStream Verbose: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: serial_abortwaitforevent(RJCP.IO.Ports.Native.Unix.SafeSerialHandle) = 0 IO.Ports.SerialPortStream Error: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: Error reading data; errno=5; description=Read End-Of-File reached

I think the abortwaitforevent comes from a Write call to port while the port receives characters. Do you think this can be a problem?

jcurl commented 5 years ago

The serial_abortwaitforevent shouldn't be the problem as:

So in the below we see that:

IO.Ports.SerialPortStream Verbose: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: serial_waitforevent(RJCP.IO.Ports.Native.Unix.SafeSerialHandle, ReadWriteEvent) == WriteEvent
IO.Ports.SerialPortStream Verbose: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: serial_write(RJCP.IO.Ports.Native.Unix.SafeSerialHandle, 140339872406384, 7) == 7
IO.Ports.SerialPortStream Verbose: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: serial_abortwaitforevent(RJCP.IO.Ports.Native.Unix.SafeSerialHandle) = 0
IO.Ports.SerialPortStream Verbose: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: serial_waitforevent(RJCP.IO.Ports.Native.Unix.SafeSerialHandle, ReadEvent) == ReadEvent
IO.Ports.SerialPortStream Error: 0 : /dev/ttyBrab_Bb9ur5o6jsg10_0: ReadWriteThread: Error reading data; errno=5; description=Read End-Of-File reached

In the log above, it looks like serial_abortwaitforevent was called because of a call to write data (the only second reason would mean there is a thread stop message immediately before, which isn't present, so there's no error), and data arrived in the serial port shortly prior.

Code went to read what's on the serial port (via the posix read()), which the OS returned zero.

The code in events.c is pretty simple:

  int r = select(maxfd + 1,
                 &serreadfds,
                 (event & WRITEEVENT) ? &serwritefds : NULL,
                 NULL, ptval);
  if (r < 0) {
    ...
  } else if (r > 0) {
    serialevent_t resultevent = NOEVENT;
    if ((event & READEVENT) &&
        FD_ISSET(handle->fd, &serreadfds)) resultevent |= READEVENT;
    if ((event & WRITEEVENT) &&
        FD_ISSET(handle->fd, &serwritefds)) resultevent |= WRITEEVENT;
    if (FD_ISSET(handle->prfd, &serreadfds)) {
      pthread_mutex_lock(&(handle->abortmutex));
      // Something wrote to the pipe to abort the select()
      char buffer[128];
      while (read(handle->prfd, buffer, SIZEOF_ARRAY(buffer)) > 0) { }
      errno = 0;
      handle->abortpending = FALSE;
      pthread_mutex_unlock(&(handle->abortmutex));
    }
    return resultevent;
  }

so it's clear that the OS said there is something to read on the serial port. Even if something is written to the pipe, it would be cleared and ignored (select() may set multiple FDs, but we don't care, so long as it is one or more).

So this leads to a read(), which the OS said that the FD for the serial port is readable, and it is expected that a read() returns non-zero if there is data, or zero if there is an EOF condition, which occurred in the logs.

One point to note out, it's not the first time that a driver returns that select() indicates data is available when there isn't. In the file openserial.c I make the FD non-blocking, that this situation can be caught and logged, instead of entering a deadlock (waiting for data that will never arrive)

handle->fd = open(handle->device, O_RDWR | O_NOCTTY | O_NONBLOCK);

Posix says that a read() may return zero if there is simply no data available, so one can "poll" for data, but as the select() method indicated that data was available, we can only deduce the EOF condition was reached, or there is a bug in the serial driver.

The second reason a read() may return zero is if the argument for the number of bytes to read were zero. An strace would confirm this (it's not in the .NET logs on what the input was in case of error), but we can assume for now that the input was non-zero. If it were zero then the following code would be buggy:

if (m_Buffer.Serial.ReadBufferNotFull.WaitOne(0))
  rwevent |= SerialReadWriteEvent.ReadEvent;

I haven't been aware of any bugs in the buffer handling, the producer/consumer threads are locked and only the thread can put data in the buffer so it can never be made full after it's seen as not full as this part is single threaded.

So at this time, I suspect still it is a read from the driver returning 0, but an strace might provide more information.

jcurl commented 5 years ago

Not sure how to continue, but at this time I believe it is a driver error, as it isn't expected to get a read of zero. If you're able to get more information, please reopen this ticket and I'll look at it again.