nyholku / purejavacomm

Pure Java implementation of JavaComm SerialPort
http://www.sparetimelabs.com/purejavacomm/index.html
BSD 3-Clause "New" or "Revised" License
366 stars 146 forks source link

select() error 87 parameter is incorrect #113

Open swarwick opened 5 years ago

swarwick commented 5 years ago

(fragment deploy): 2018-11-20 10:35:33.118000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@64242d75) => 1 (fragment deploy): 2018-11-20 10:35:33.118000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[1]]) => 0 (fragment deploy): 2018-11-20 10:35:33.118000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@24c9f3b7) => 1 (fragment deploy): 2018-11-20 10:35:33.119000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < read(0,[1024,0x30,0x30,0x0D,0x0A,0x30,0x4D,0x0A,0x30...],1024) => 1 (fragment deploy): 2018-11-20 10:35:33.119000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[0]]) => 0 (fragment deploy): 2018-11-20 10:35:33.119000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@64242d75) => 1 (fragment deploy): 2018-11-20 10:35:33.119000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[1]]) => 0 (fragment deploy): 2018-11-20 10:35:33.119000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@24c9f3b7) => 1 (fragment deploy): 2018-11-20 10:35:33.119000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < read(0,[1024,0x2E,0x30,0x0D,0x0A,0x30,0x4D,0x0A,0x30...],1024) => 1 (fragment deploy): 2018-11-20 10:35:33.119000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[0]]) => 0 (fragment deploy): 2018-11-20 10:35:33.120000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@64242d75) => 1 (fragment deploy): 2018-11-20 10:35:33.120000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[1]]) => 0 (fragment deploy): 2018-11-20 10:35:33.121000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@24c9f3b7) => 1 (fragment deploy): 2018-11-20 10:35:33.121000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < read(0,[1024,0x30,0x30,0x0D,0x0A,0x30,0x4D,0x0A,0x30...],1024) => 1 (fragment deploy): 2018-11-20 10:35:33.121000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[0]]) => 0 (fragment deploy): 2018-11-20 10:35:33.122000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@64242d75) => 1 (fragment deploy): 2018-11-20 10:35:33.122000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[2]]) => 0 (fragment deploy): 2018-11-20 10:35:33.123000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@24c9f3b7) => 1 (fragment deploy): 2018-11-20 10:35:33.123000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < read(0,[1024,0x30,0x30,0x0D,0x0A,0x30,0x4D,0x0A,0x30...],1024) => 2 (fragment deploy): 2018-11-20 10:35:33.123000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[0]]) => 0 (fragment deploy): 2018-11-20 10:35:33.124000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@64242d75) => 1 (fragment deploy): 2018-11-20 10:35:33.124000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[1]]) => 0 (fragment deploy): 2018-11-20 10:35:33.124000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[0],[],[],jtermios.TimeVal@24c9f3b7) => 1 (fragment deploy): 2018-11-20 10:35:33.124000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < read(0,[1024,0x30,0x30,0x0D,0x0A,0x30,0x4D,0x0A,0x30...],1024) => 1 (fragment deploy): 2018-11-20 10:35:33.124000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < ioctl(0,1074030207,[[0]]) => 0 (fragment deploy): 2018-11-20 10:35:33.125000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: fail() class jtermios.windows.JTermiosImpl line 924, Windows GetLastError()= 87, The parameter is incorrect. (fragment deploy): 2018-11-20 10:35:33.125000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: jtermios.windows.JTermiosImpl$Fail (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: at jtermios.windows.JTermiosImpl$Port.fail(JTermiosImpl.java:100) (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: at jtermios.windows.JTermiosImpl.select(JTermiosImpl.java:924) (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: at jtermios.JTermios.select(JTermios.java:473) (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: at purejavacomm.PureJavaSerialPort$3.run(PureJavaSerialPort.java:1225) (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: at java.lang.Thread.run(Thread.java:748) (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < select(1,[],[],[],jtermios.TimeVal@64242d75) => -1 (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < errno() => 24 (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: select() or poll() returned -1, errno 24 (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < fcntl(0, 3, 0) => 131074 (fragment deploy): 2018-11-20 10:35:33.171000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < fcntl(0, 4, 131078) => 0 (fragment deploy): 2018-11-20 10:35:33.221000-0600 [22508:COM3] ERROR com.tibco.ep.dtm.stderr: log: < close(0) => 0

Connecting to COMM3 on a Windows 10 machine is randomly getting error 87 The parameter is incorrect error messages. Also there does not seem to be a good way to recover from this error as the line

n = select(m_FD + 1, rset, wset, null, m_HaveNudgePipe ? null : timeout); In PureJavaSerialPort.java is throwing a Fail exception which is cause but just kills the thread and no report of that error is given downstream to the end user of the library.

swarwick commented 5 years ago

Note this is running with

com.github.purejavacomm purejavacomm 1.0.2.RELEASE

but it seems like the code in this repo is still at 1.0.1-RELEASE and I do not see a branch of otherwise for 1.0.2-RELEASE so where is that code?

nyholku commented 5 years ago

Hi, thanks for reporting. As to code: There seems to be some mistake (me and Maven are not even at nodding terms, I only push stuff to Maven cause some people 'insist' on it). I will try to rectify this within two weeks.

As to the error I will investigate but these random failures on random machine are hard. I will check if I could just sweep this under the carpet by making the use of select tolerant to this problem, i.e. if select() throws on exception we could just re-try it.

swarwick commented 5 years ago

From looking around at the error description for other projects on windows it looks like the 'GetOverlappedResult' call is supposed to be called with a zero'd out structure. Its not clear to me that is happening via java but I have not dug into enough to know. Error 87 could also could just be flaky hardware, I respect the fact it's very difficult to try and debug this when its probably almost impossible to recreate as well.

If it helps at all the USB to serial dongle being used is

Sabrent USB 2.0 to Serial (9-Pin) DB-9 RS-232 Converter Cable, Prolific Chipset, Hexnuts, [Windows 10/8.1/8/7/VISTA/XP, Mac OS X 10.6 and Above] 2.5 Feet (CB-DB9P)

On a side note, I get the maven issues, we use maven and it drives me bonkers most days. Even with that though is there an issue with GitHub and pushing the code up so the rest of us can try and help debug thing when an issue arises?

swarwick commented 5 years ago

if select() throws on exception we could just re-try it. Maybe retry X number of times (X being a system variable you can set) and if that still fails then give some indication to the upper level application via an exception or callback/event?

Maybe inside the Runnable } catch (InterruptedException ie) { } finally { m_ThreadRunning = false; } add something like notifyOnReadWriteError(Exception e)

At least then the upper level application can spit something out in their own log and or try to do some issue management at runtime.

nyholku commented 5 years ago

The GetOverlappedResult should be zero by default. Also because the error is rare and random I don't think that is the issue. More likely I think this has to do with some sort of synchronisation issue i.e. the OVERLAPPED structure is re-used all the time and at some point OS changes that when PJC is not expecting that or vice versa and the Java side copy of the structure gets out of sync with the OS managed real memory and then it fails.

No there is no reason not to push stuff to GitHub once I have time, now very busy for two weeks.

swarwick commented 5 years ago

Are you ok with the idea of adding a new callback as well so the resulting application that uses the library can be notified if there is a unrecoverable failure?

nyholku commented 5 years ago

I might, though using the user would break compatibility with the original JavaComm (RIP) and RXTX. But if the user don't mind or does not use it then it would no be a problem for him/her.

swarwick commented 5 years ago

The only other way I see around this is to keep polling for ((PureJavaSerialPort)serialPort).isInternalThreadRunning() But that method even as named really should not be used, so there should be same way to alert the end user that an unrecoverable failure occurred. If you can think of a better way than a callback or using the existing event system that would be great as well. I am trying to build something that can be a little robust for even things like someone tripping over a cord and it gets unplugged, I want to be able to detect the error and close the serial port and retry X number of times to open it. Currently I only get the message in the log but no real other way I can see to check if the serial port has completely failed.

swarwick commented 5 years ago

I did find another way around this without using the internal method but it still doesnt sit well with me. I started to poll and check the getReceiveTimeout() method which calls "checkState" first and throws PureJavaIllegalStateException if the serial port is in a bad state. I feel like polling is not the best way to go with this though, but at this point I dont see any other alternative, especially if I only read and never write, since if there is no data I will not get a data ready event.

nyholku commented 5 years ago

We are talking about two different things here.

This Issue started from the error 87 which ideally should not happen as it is basically on indication of some sort of bug or logic error in the PJC code. And ideally we would find the error and fix it. As this is difficult given that I cannot reproduce this we discussed re-trying the select() and if the error is transient then this would solve the issue.

Now you brought into to discussion detecting and/or recovering from fatal errors via a callback or some such. This could added as I indicated. Weather or not it would solve the issue if re-trying the select does not work is another matter. And detecting the catastrophic failure is also problematic.

My first stab at this would be to add a re-try to the select() and do that internally in the Windows backend where the select() is implemented. So first find out where exactly the in jtermios.windows.JTermiosImpl.select() the error is first detected and modify the call to retry the failed call or call sequence there.

swarwick commented 5 years ago

yes you are correct sorry I did start to blend the two issues together. I will create another issue to track the fail/recovery by user scenario.

derikd commented 5 years ago

Not to go too far off topic….

I respect that JavaComm was designed to be backwards compatible with RxTx. That was a great boon to me back when I switched in 2010. But I haven’t heard a peep out of that older library in years. It isn’t because it didn’t have limitations. PJC is now its own standard. That said, lets not break the new standard either. Unless we must.

That said, maybe there is no harm in letting PJC evolve as needs require. If there is no other solution to an instability than writing a change to the interface, then I, for one, will not mind tweaking all my projects to handle one new function from time to time.

I am not really qualified to comment on this particular issue as I haven’t experienced it in real world applications with my app. But that probably has more to do with the nature of my application rather than some immunity to the problem.

— derik

On Nov 26, 2018, at 10:38 PM, nyholku notifications@github.com<mailto:notifications@github.com> wrote:

I might, though using the user would break compatibility with the original JavaComm (RIP) and RXTX. But if the user don't mind or does not use it then it would no be a problem for him/her.

nyholku commented 5 years ago

Thanks for the opinion, I appreciate it when people chime in.

On 27 Nov 2018, at 21.05, derikd notifications@github.com wrote:

Not to go too far off topic….

I respect that JavaComm was designed to be backwards compatible with RxTx. That was a great boon to me back when I switched in 2010. But I haven’t heard a peep out of that older library in years. It isn’t because it didn’t have limitations. PJC is now its own standard. That said, lets not break the new standard either. Unless we must.

That said, maybe there is no harm in letting PJC evolve as needs require. If there is no other solution to an instability than writing a change to the interface, then I, for one, will not mind tweaking all my projects to handle one new function from time to time.

I am not really qualified to comment on this particular issue as I haven’t experienced it in real world applications with my app. But that probably has more to do with the nature of my application rather than some immunity to the problem.

— derik

On Nov 26, 2018, at 10:38 PM, nyholku notifications@github.com<mailto:notifications@github.com> wrote:

I might, though using the user would break compatibility with the original JavaComm (RIP) and RXTX. But if the user don't mind or does not use it then it would no be a problem for him/her.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

swarwick commented 5 years ago

Could this be an issue that GetOverlappedResult is returning ERROR_IO_INCOMPLETE but the code is not checking for that? It looks like select checks for that but write doesnt (although it seems to later down the code but doesnt seem like it would reacht that code path).

nyholku commented 5 years ago

Could be, if you can debug that, it would be great!

swarwick commented 5 years ago

I am having a hard time debugging this myself as well (I found the 1.0.2 source jar and I am trying to use that but not 100% sure its valid). Also this is happening at a customer site and really I have to do old school debugging with logs as I do not have direct access to the hardware.

I was curious what you thought on line 838 if (!GetOverlappedResult(port.m_Comm, port.m_SelOVL, port.m_SelN, false)) if (GetLastError() != ERROR_IO_INCOMPLETE) port.fail(); After that it current checks if error code is ERROR_IO_INCOMPLETE, should it also check for ERROR_INVALID_PARAMETER? if (!GetOverlappedResult(port.m_Comm, port.m_SelOVL, port.m_SelN, false)) if (GetLastError() != ERROR_IO_INCOMPLETE && GetLastError() != ERROR_INVALID_PARAMETER) port.fail();

nyholku commented 5 years ago

I don't think that is correct approach although it might sweep the problem under the carpet. The issue here is that (most likely) the OVERLAPPED structure is still in use by previous invocation of whatever operation my code used it for last time. So I think we should just loop there

while (GetOverlappedResult() || GetLastError()== ERROR_IO_INCOMPLETE || GetLastError() == ERROR_INVALID_PARAMETER)

swarwick commented 5 years ago

I was given access to the box that has the error and was able to debug with live data. I got the error with existing code of:

if (!GetOverlappedResult(port.m_Comm, port.m_SelOVL, port.m_SelN, false)) { if (GetLastError() != ERROR_IO_INCOMPLETE) port.fail(); }

Changing the code to

if (!GetOverlappedResult(port.m_Comm, port.m_SelOVL, port.m_SelN, false)) { if (GetLastError() != ERROR_IO_INCOMPLETE && GetLastError() != ERROR_INVALID_PARAMETER) port.fail(); if (GetLastError() == ERROR_INVALID_PARAMETER) { log = log && log(1, "Invalid parameter event\n"); } }

Resolved the issue and let the software keep running and I did see log event for invalid parameter still but it skipped over and kept getting data.

swarwick commented 5 years ago

it is a rare event, it happened 2 times over 5-6 mins

swarwick commented 5 years ago

After awhile of running the machine blue screen of death appears so I am not sure if thats from this fix or some other issue but thought I would mention it.

swarwick commented 5 years ago

So interesting note to anyone else that might encounter this issue. The cable being used was a prolific USB to serial cable. We switched over to a FDTI chipset cable and the issue went away.

derikd commented 5 years ago

That is worth noting. Thank you. We use Prolific parts on all our devices. The big problem with Prolific is that for a long time they were cloned like crazy and the clones don’t work so well. At least, that is what Prolific is telling people. It could just be that their parts are not reliable. That said we have not had much trouble with the prolific parts EXCEPT for a short period of time when Prolific released a new driver though the windows update system that did not work. It was fixed in a few weeks, and in the mean time, the old driver could be used in a pinch. — derik

On Nov 30, 2018, at 8:07 AM, Steven Warwick notifications@github.com<mailto:notifications@github.com> wrote:

So interesting note to anyone else that might encounter this issue. The cable being used was a prolific USB to serial cable. We switched over to a FDTI chipset cable and the issue went away.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nyholku_purejavacomm_issues_113-23issuecomment-2D443251937&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=HrOydAaLpSJU88_Itq5KJ2E5OBsxkZkc4in7julQgSk&m=pkqK0QYFHO23vZjb8NO7tXKliv4OZIPfKuLL24JUvgo&s=h-El8bgGA5udbs2bLweIfYty6VKyTlZyrcoIzMCuiDk&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AOMXD6US3GLw1qjNPe0FTB2NZKf9n3WJks5u0VexgaJpZM4YroOa&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=HrOydAaLpSJU88_Itq5KJ2E5OBsxkZkc4in7julQgSk&m=pkqK0QYFHO23vZjb8NO7tXKliv4OZIPfKuLL24JUvgo&s=CYg_tIh2c1hDnn2qar_A4_wWUS6rzalS1FQd0eaKba0&e=.