Closed st-arnold closed 8 years ago
Pete, I appreciate your feedback here. What I'm trying to obtain is a set of concrete steps that have some likelihood of reproducing the problem you've seen. Maybe @nyholku can help me out, but I'm not familiar with the situation and/or code that has been triggering the problem, and I'd be hard pressed to reproduce it from scratch.
@PeteSL perhaps with your extensive experience with all of the Windows APIs you could point out the flaw in JNA's usage of TLS as I've described it?
@twall This is not the place to delve further into the underlying JNA architecture.
I am not being argumentative. I have told you that this was concretely identified and it was resolved in a prior issue with PureJavaComm (please review recent issues that were resolved). I was always bothered by the statement "it is thread-local so it can't be overwritten" when we had someone definitively identify it being overwritten and implementing the LastErrorException code resolved this issue. As I have also said, you are not going to take steps X, Y, and Z to recreate it because it is a -timing- issue and may also be an issue with specific JVMs and JNA implementations.
The expectation by JNA that each Java thread = a native thread is an invalid assumption and I believe they are leaving themselves open for future issues with other projects. This project, however, is working because we are not depending on JNA's native thread-local storage, at least in the Windows implementation. Quite honestly, I don't have the time to pursue this with JNA right now and I don't think the PureJavaComm group is the place to pursue it either.
@twall Sorry I can't help you with a test case as this never happened to me.
The evidence for the problem, such as it is, is here:
https://github.com/nyholku/purejavacomm/issues/57
For me this is not 100% conclusive but I wanted to err on the safe side so we applied the try/catch fix as it looked safe enough.
My take on the situation:
We have two cases where this happened, the original issue reporter's and Pete's own experience (though in hindsight that is not very clear from the thread but I have presumed he saw this himself).
It is difficult to see how PJC bug could cause a problem like this
It is equally difficult to see what the problem on the JNA side is, if any.
A Windows anomaly WaitCommEvent reporting false with GetLastError reporting 0 would be a nice way to push this into Microsofts corner;)
That is also hard to believe, though it could be as maybe the proper way to code with GetLastError is to check that it is equal to some value whereas in PJC code I check (effectively via else) not-equal to the expected value and report that as a failure. This would explain why no-one else has seen (or publicly reported) this problem with Win serial API.
@nyholku Sorry this issue got off track. Hopefully we will get closure soon on it.
It may be that the native code is just fine but... their assumption that native thread = Java thread is an invalid assumption, especially with fork/join thread pools introduced in more recent JVMs. Also, many JVMs manage native threads internally for various reasons (there was a lot of problems in early JVMs on various operating systems using up thread handles in larger Java applications, for instance). That may be what we were seeing, as well and, to be honest, what I am leaning towards as the root cause at this point: the JVM reusing a native thread for multiple Java threads.
In any case, going to the LastErrorException puts us in the right place guaranteeing a Java thread-safe GetLastError(). One cleanup you may want to consider (and I will be happy to add if you want it) is to not call FormatMessage in fail() anymore but take the message from the LastErrorException message. Easy enough to change the ThreadLocal<int[]> to a ThreadLocal
Thanks again for your great work. We identified a bug in JNA (may be just the assumption that native = Java regarding threads) but also a very effective workaround made much easier by your placing the Windows API calls in their own class and encapsulating them. That was a very simple change from a native thread-safe GetLastError() to a Java thread-safe GetLastError() because of that isolation you did early on.
@twall Tim, one thing to consider (and this may very well have been the case) is that a Java thread does not have to equal a native thread. In the early Linux JVMs from Sun, for instance, using native threads was experimental because of some thread handle issues. In current JVMs, there are Fork/Join pools which an author can forcibly share Java threads. And no place in JVM architecture requirements is there a mandate that native=Java for threading. In fact, the exact opposite has always been in place (the JVM implementation is independent of the underlying OS and can manage threading independently of the OS). I know you think the current Oracle JVM for Windows is a one to one correspondence, but maybe not. And the JNA folks can't assume every JVM will be a one to one thread match.
I go back to my original statement in issue #57 that something is overwriting the "thread-local" variable which tells me it is not Java thread local. I fully believe that to ensure platform independence (a JNA stated intent), the native code needs to be using a Java ThreadLocal variable or passing out a structure on the stack that contains any variables that need to be thread-local to the Java threads. Java ThreadLocal variables are relatively high overhead but are the only things guaranteed to be Java thread-local.
We have our workaround for PureJavaComm that is probably the way it should have been implemented in the first place, had he realized the Native.getLastError() was not guaranteed to be Java thread-safe (again, may be native thread-safe but that doesn't make it Java thread-safe) so this issue will end up being closed for PureJavaComm.
If you want to take this thought process over to the JNA group, please feel free to. Remember, nowhere in the JVM architectural specs does it say a Java thread = a native thread and, in fact, the JVM architects continue to move more and more to JVMs managing their own threads. These are all things to keep in mind if you take this back to the JNA folks and these caveats apply to all OS platforms, not just Windows.
On Sep 14, 2015, at 6:28 AM, PeteSL notifications@github.com wrote:
@twall Tim, one thing to consider (and this may very well have been the case) is that a Java thread does not have to equal a native thread. In the early Linux JVMs from Sun, for instance, using native threads was experimental because of some thread handle issues. In current JVMs, there are Fork/Join pools which an author can forcibly share Java threads. And no place in JVM architecture requirements is there a mandate that native=Java for threading. In fact, the exact opposite has always been in place (the JVM implementation is independent of the underlying OS and can manage threading independently of the OS). I know you think the current Oracle JVM for Windows is a one to one correspondence, but maybe not. And the JNA folks can't assume every JVM will be a one to one thread match.
I’m quite willing to entertain the notion that Java thread != native thread. A couple of points, though:
What is quite common, though, is folks using an explicit call to GetLastError() instead of Native.getLastError(), or inadvertently making additional native calls before retrieving the last error result.
I’m trying to think what might be useful to expose more information, without introducing undue overhead on the native side. Is there some check that could be made (comparing JNIEnv*, perhaps) to identify that the native TLS is effectively being written by a different Java Thread than a previous write? Certainly calling Thread.currentThread() is doable, but would have a significant execution impact which might end up affecting results.
I go back to my original statement in issue #57 that something is overwriting the "thread-local" variable which tells me it is not Java thread local. I fully believe that to ensure platform independence (a JNA stated intent), the native code needs to be using a Java ThreadLocal variable or passing out a structure on the stack that contains any variables that need to be thread-local to the Java threads. Java ThreadLocal variables are relatively high overhead but are the only things guaranteed to be Java thread-local.
We have our workaround for PureJavaComm that is probably the way it should have been implemented in the first place, had he realized the Native.getLastError() was not guaranteed to be Java thread-safe (again, may be native thread-safe but that doesn't make it Java thread-safe) so this issue will end up being closed for PureJavaComm.
If you want to take this thought process over to the JNA group, please feel free to. Remember, nowhere in the JVM architectural specs does it say a Java thread = a native thread and, in fact, the JVM architects continue to move more and more to JVMs managing their own threads. These are all things to keep in mind if you take this back to the JNA folks and these caveats apply to all OS platforms, not just Windows.
— Reply to this email directly or view it on GitHub.
@twall Tim, that was sent privately to you but you chose to broadcast it. That terminates this conversation.
Bottom line is we have it working with LastErrorException, it was not working with Native.getLastError() (so get that out of your thought process) because Native.getLastError() was getting overwritten within one Java line (yes, there were non-native log calls but no calls to JNA) and I have given you two hypothesizes which you consistently say "prove it". I don't have to and I am done with this. I don't care whether you figure this out in the JNA group or not at this point because your attitude is "we are right unless you can prove otherwise". I have given you examples of where native threads might be reused (and even though "green threads" were left behind in Sun's Linux implementation, it doesn't mean and was never put into the JVM spec that a Java Thread must correspond 1 to 1 to a native thread). If you can come up with a reason why Native.getLastError() was being overwritten when nothing in that thread was making a JNA call in between the call causing an error and the Native.getLastError() call, great! I am done offering possibilities or discussing this further with you.
@nyholku @PeteSL
I'm happy to report that the test suite in version 0.0.28 now passes all tests when using jna-4.0.0.jar.
But when using 3.5.1, Test 9 causes Java.exe to crash ("An unhandled win32 exception occurred in java.exe").
Test9 - treshold 100, timeout 100 (not a single "dot" gets printed before the crash)
Here's my version info:
java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
@st-arnold Did it produce a crash report?
@st-arnold @nyholku Thanks! I can almost guess what caused the crash with 3.5.1 and you can verify or deny this if you have time. Part of the last commit for direct mapping support was to add Native.setPreserveLastError(true); to each JTermioImpl static initializer and WinAPI static initializer in the Windows implementation (right up near the top of the source code). We don't need it in WinAPI because we are exclusively using LastErrorException in that code so can you comment out (//) that line and recompile and and rerun against the test suite? If it runs without error, we have identified something in the pre-4.0 JNA. If it fails in the same place, then we have identified an issue with pre-4.0 JNA and LastErrorExceptions.
@nyholku No, Windows just offers Just-In Time debugging (which I'm not competent enough to do, I'm afraid).
I tested with another computer (old 32bit WinXP with Java v1.6.0_23), using the same USB-RS232 adapter, and there it works with both 3.5.1 and 4.0.0.
So it seems to be related to Windows version, 32 vs. 64 bit or Java version. I'll try updating Java on that old computer and then test again.
Some more info: Looks like 1.6 is the latest Java for WinXP so I couldn't update. I tested with another USB-RS232 cable on Win8.1 64bit Java 1.8, same problem, so it's probably not a hardware issue (although both are FTDI adapters but slighly different). I also have one PL-2303 adapter but it doesn't want to work in this computer...
@PeteSL I'll try to recompile the source, as a newbie I haven't done it before and the "compile" script seems to be made for 'ix.
Hmm... Seems to be an issue with Java 1.8 ? I ran the test suite with JNA 3.5.1 + Java 1.7.0-71 and all tests passed. (Didn't modify the source yet)
@st-arnold the compile script has the command line for compiling the code (it is a single line) and IIRC it works verbatim on Windows too (maybe replace the slashes with backslashes, don't remember)
@st-arnold I would recommend getting Eclipse for Windows, then you can single step code easily and see variable values etc etc.
@PeteSL Did you mean commenting out just line 101 in WinAPI.java or something else in addition?
I managed to compile (with 1.7, I had only that JDK installed, should it matter?) with that line commented out and the problem persists when run with java 1.8 and JNA 3.5.1.
@nyholku I have both Eclipse and Intellij IDEA installed. Just tried single stepping purejavacomm with the latter, seems to work. So now I'm better equipped to test and debug.
Now to learn how to make it use lib 3.5.1 ....
@st-arnold Yes, that is exactly what I was asking (comment out 101 in WinAPI). That is great information. Sounds like there is some interaction with 3.5.1 and the 1.8 JVM (are you using 32 or 64 bit Java on that test?). The reason I had you comment that line out is because in 4.0+, that is a no-operation method (literally does nothing). In 3.5.1, it changes how they work with Native.getLastError() so I was checking to see if that was the change breaking 3.5.1. Apparently that is not the change that breaks 3.5.1 with JVM 1.8 but at least you have confirmed this not an issue at all with 4.0 and higher JNA (I use both 32 and 64 bit JVMs to test they are Oracle 1.8 but I am solely testing with JNA 4.1). 1.8 introduced a number of new threading mechanisms so that might be where the failure is occurring. Anyone's guess ;-) Thanks much for your work testing this!
OK, now I can reproduce the problem while debugging in IDEA. (It took a little while to figure out how to compile with JNA 4.0.0 but run with 3.5.1.)
The crash happens at Test9.java line 90: startReadThread();
If I "step over" that statement, it crashes. But if I step into the function and single step over statements until return, I can reach Test9.java line 95: m_Out.write(m_TxBuffer, 0, txn); And then it crashes.
Stepping into write function, the crash happens when entering checkState() function. I step into it and then java crashes.
Commenting out the first startReadThread() at line 84 prevents the crash.
In theory, Test 9 is really "illegal" as it is using 2 threads to read the same port. If the port timed out (like it is supposed to) in the line 85 thread, the serial port might be left in an unknown state when the line 90 thread starts. This is pure conjecture but there might be something in the way 1.8 handles threading which is messing with the 3.5.1 handling of native dispatch and callback where 4.0 they changed that some so it might be reset properly before the second thread starts. This is pure -conjecture- on my part. I don't see an issue with saying fully tested on 4.0+, may see issues with multithreading in JNA 3.5.1 and JVM 1.8 or higher. If they are running 1.8 JVM, they should be running 4.0 or higher JNA, IMHO.
Okay, I'm not going to dig deeper as it's working fine with JNA 4.0.0. Maybe the test should be modified, though, so it would also pass with 3.5.1? (Or at least change run.bat to use 4.0.0 lib, because that's what newbies like me try first out of the box :)
I guess this issue can be closed now?
Hi,
I'm totally newbie and just tried this library "out of the box" on Win8.1. Noticed that Test Suite referenced from run.bat seems to be broken in the last version (0.0.25).
Probable cause: recent changes to use LastErrorException? TestFreeFormPortIdentifiers.testMissingPortInCommPortIdentifier() seems to expect NoSuchPortException but that is no longer thrown?
Version 0.0.23 jar works. Below output when using 0.0.25.
PureJavaComm Test Suite Using port: COM1 TestMissingPort jtermios.windows.JTermiosImpl$Fail
Method) at jtermios.windows.WinAPI.CreateFile(WinAPI.java:584) at jtermios.windows.JTermiosImpl$Port.open(JTermiosImpl.java:140) at jtermios.windows.JTermiosImpl.open(JTermiosImpl.java:353) at jtermios.JTermios.open(JTermios.java:389) at purejavacomm.CommPortIdentifier.getPortIdentifier(CommPortIdentifier. java:101) at purejavacomm.testsuite.TestFreeFormPortIdentifiers.testMissingPortInC ommPortIdentifier(TestFreeFormPortIdentifiers.java:25) at purejavacomm.testsuite.TestSuite.main(TestSuite.java:43) Caused by: java.lang.RuntimeException: [2]Määritettyä tiedostoa ei löydy. getLas tError=0 at jtermios.windows.JTermiosImpl$Port.fail(JTermiosImpl.java:94) at jtermios.windows.JTermiosImpl$Port.open(JTermiosImpl.java:147) ... 5 more
FAILED
Got an identifier for non-exisiting path
Test failure