nyholku / purejavacomm

Pure Java implementation of JavaComm SerialPort
http://www.sparetimelabs.com/purejavacomm/index.html
BSD 3-Clause "New" or "Revised" License
362 stars 146 forks source link

IOException due to JTermios.Impl.write returning -1 #103

Closed crosland closed 6 years ago

crosland commented 6 years ago

Purejavacomm is used by the jmri project http://jmri.sourceforge.net/

A few users are reporting IOExceptions after one Tx and one Rx message to USB serial ports have happened OK (i.e., communication is initially established). The USB device has NOT been unplugged or powered down.

A stack trace is included below.

The system is Windows 10 home with fall creator update.

I have cloned the purejavacomm repo to work on the source and modified the exception report to show the -1 return code from JTermios, and the JTermios errno value. That errno value (24) is only set in one place with the comment "pipe() not implemented on Windows backend".

I tried to debug further by enabling logging but the problem then seems to go away.

I have two other system (Win 10 home and Win10 enterprise) with the same jmri versions installed and the same USB hardware. Both of these work OK (so far).

There may be some randomness to the failure, as reported by users. In my case, it fails way more often than not.

My gut feeling is a timing issue somewhere.

Help!

Andrew Crosland

2017-12-26 19:27:57,525 serialdriver.SerialDriverAdapter INFO - COM5 port opened at 9600 baud, sees DTR: true RTS: true DSR: false CTS: false CD: false [main] 2017-12-26 19:27:57,526 sprog.SprogSystemConnectionMemo DEBUG - start command station queuing thread [main] 2017-12-26 19:27:57,777 sprog.SprogCommandStation DEBUG - Slot thread starts [SPROG slot thread] 2017-12-26 19:27:57,876 sprog.SprogCommandStation DEBUG - Sending packet O FF 00 FF [SPROG slot thread] 2017-12-26 19:27:57,912 sprog.SprogCommandStation DEBUG - Reply received: StatusDue = 0 ignoreReply = 0 P> [AWT-EventQueue-0] 2017-12-26 19:27:57,913 sprog.SprogCommandStation DEBUG - Get next packet to send [AWT-EventQueue-0] 2017-12-26 19:27:57,914 sprog.SprogCommandStation DEBUG - Sending packet O FF 00 FF [AWT-EventQueue-0] 2017-12-26 19:27:57,919 sprog.SprogTrafficController WARN - sendMessage: Exception: [AWT-EventQueue-0] java.io.IOException: An IO error has occured. jtermios result: -1 errno: 24 at purejavacomm.PureJavaSerialPort$1.write(PureJavaSerialPort.java:641) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97)

nyholku commented 6 years ago

Hi Andrew,

thank you for a great problem description.

I'm travelling so have little to time to look at this until after 7th of Jan.

I think the errno 24 is a red herring. Look at code near PureJavaSerialPort:1140, the method that sets the error code is used only once in Windows as the pipe() is not implemented. So what you are seeing is that error code (which is/was benevolent) after something else (that does not re-set the error code) goes wrong.

Looking at the stack trace t PureJavaSerialPort.java:641 I note that it looks like the IOException is thrown after a jtermios.JTermios.write a few lines above returns negative value which usually (but not always) is an indication that the port is closed. You should try to investigate that by adding some little debug output to that method to see what path it takes when it returns a negative value.

As this seems timing dependent try to add very little output or maybe better yet output to a memory buffer (Java has a memory stream) which you dump to console when the you hit that exception.

crosland commented 6 years ago

Hi,

Thanks for the response. I will try to follow your suggestions.

One user reports that everything is OK on two Core i5 systems but fails on two Xeon systems. My own experience is failure on AMD, success on Intel (one i5, one Pentium N4200). Maybe the different architectures behave differetnly at a low level. I also ensured I have the latest chipset drivers. The problem has also been seen intermittently on MacOS, but not, so far, on Linux.

nyholku commented 6 years ago

Strange, given the, by now long, history and (I think) wide usage of this library I would have thought that if there was a timing issue it would have surfaced before/for other applications as well.

Anything special about how you use the library?

Are you using (a lot of) threads?

Or are you using the events?

The suggestion about the architecture is thought provoking but the Windows API is pretty standard and well defined independent of low level details so I would not think this is the issue.

As you can reproduce this the best approach is to figure out exactly where it fails and take it from there.

wbr Kusti

crosland commented 6 years ago

I am not enough of a programmer to understand what is happening, but it's something to do with overlapped i/o. The failure is at line 508 of JTermios.Impl.java in the write() method:

if (!GetOverlappedResult(port.m_Comm, port.m_WrOVL, port.m_WrN, false))
    port.fail();

I think this is testing if the previous i/o operation has completed. Is that correct?

The part of the app that fails is generating messages as fast as possible so it may be possible that the next message is sent from the app before the previous i/o operation has completed.

nyholku commented 6 years ago

In there try printing result of GetLastError before port.fail() to see the Windows error code.

crosland commented 6 years ago

Ah, I already have, that's how I got the failing line number. I hacked the fail() code a little, and see:

fail() class jtermios.windows.JTermiosImpl line 509 Windows GetLastError()= 995, The I/O operation has been aborted because of either a thread exit or an application request.

nyholku commented 6 years ago

On 28 Dec 2017, at 19:34, Andrew Crosland notifications@github.com wrote:

Ah, I already have, that's how I got the failing line number. I hacked the fail() code a little, and see:

fail() class jtermios.windows.JTermiosImpl line 509 Windows GetLastError()= 995, The I/O operation has been aborted because of either a thread exit or an application request.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Ah, ok. I will think about this, right now I'm busy, I'll be back.

Do you know how the application uses PureJavaComm, is there a single thread that does a lot of calls (you say it works as fast as possible) to PJC or are there multiple threads doing stuff?

wbr Kusti

crosland commented 6 years ago

That's interesting. The two messages are sent from different threads. The 2nd message should not be sent until a reply to the 1st message is received from the hardware (by which time the first Tx i/o operation is complete). I will look more closely at the app code to see if there is a problem here.

blackfoxoh commented 6 years ago

Hi, quite a long time ago since the last mesage here. Are there any news on this? We are facing problems at the moment as well even I'm not sure if it's the same issue. I guess there is no fix so far? Or does someone have a fork with a potential fix I could test against our problem?

nyholku commented 6 years ago

I've not heard back from crosland so I've presumed that he has found the problem/solution elsewhere, so I guess I should close this issue and you should file a new one.

crosland commented 6 years ago

Sorry, I did not follow up. In the end there was no problem in purejavacomm. The error was due to one of my threads (the one using pjc) dying unexpectedly.

nyholku commented 6 years ago

Ok, thanks for reporting back.