nyholku / purejavacomm

Pure Java implementation of JavaComm SerialPort
http://www.sparetimelabs.com/purejavacomm/index.html
BSD 3-Clause "New" or "Revised" License
363 stars 146 forks source link

Test suite broken in 0.0.25 #66

Closed st-arnold closed 8 years ago

st-arnold commented 9 years ago

Hi,

I'm totally newbie and just tried this library "out of the box" on Win8.1. Noticed that Test Suite referenced from run.bat seems to be broken in the last version (0.0.25).

Probable cause: recent changes to use LastErrorException? TestFreeFormPortIdentifiers.testMissingPortInCommPortIdentifier() seems to expect NoSuchPortException but that is no longer thrown?

Version 0.0.23 jar works. Below output when using 0.0.25.

PureJavaComm Test Suite Using port: COM1 TestMissingPort jtermios.windows.JTermiosImpl$Fail

    at jtermios.windows.WinAPI$Windows_kernel32_lib_Direct.CreateFile(Native

Method) at jtermios.windows.WinAPI.CreateFile(WinAPI.java:584) at jtermios.windows.JTermiosImpl$Port.open(JTermiosImpl.java:140) at jtermios.windows.JTermiosImpl.open(JTermiosImpl.java:353) at jtermios.JTermios.open(JTermios.java:389) at purejavacomm.CommPortIdentifier.getPortIdentifier(CommPortIdentifier. java:101) at purejavacomm.testsuite.TestFreeFormPortIdentifiers.testMissingPortInC ommPortIdentifier(TestFreeFormPortIdentifiers.java:25) at purejavacomm.testsuite.TestSuite.main(TestSuite.java:43) Caused by: java.lang.RuntimeException: [2]Määritettyä tiedostoa ei löydy. getLas tError=0 at jtermios.windows.JTermiosImpl$Port.fail(JTermiosImpl.java:94) at jtermios.windows.JTermiosImpl$Port.open(JTermiosImpl.java:147) ... 5 more

FAILED

Got an identifier for non-exisiting path

Test failure

nyholku commented 9 years ago

Ok, thanks for reporting this. I'll have a look in the evening.

nyholku commented 9 years ago

Just released 0.0.26 that should fix this issue. Please test and report back so we can close this (or continue debugging).

st-arnold commented 9 years ago

It still didn't work. It looks to me that there should be a try-catch in CommPortIdentifier.getPortIdentifier ?

PureJavaComm Test Suite Using port: COM1 TestMissingPort
jtermios.windows.JTermiosImpl$Fail at jtermios.windows.JTermiosImpl$Port.open(JTermiosImpl.java:144) at jtermios.windows.JTermiosImpl.open(JTermiosImpl.java:353) at jtermios.JTermios.open(JTermios.java:419) at purejavacomm.CommPortIdentifier.getPortIdentifier(CommPortIdentifier. java:101) at purejavacomm.testsuite.TestFreeFormPortIdentifiers.testMissingPortInC ommPortIdentifier(TestFreeFormPortIdentifiers.java:25) at purejavacomm.testsuite.TestSuite.main(TestSuite.java:43) Caused by: java.lang.RuntimeException: GetLastError() returned 0 getLastError=2 at jtermios.windows.JTermiosImpl$Port.fail(JTermiosImpl.java:94) at jtermios.windows.JTermiosImpl$Port.open(JTermiosImpl.java:147) ... 5 more FAILED

Got an identifier for non-exisiting path

Test failure

PeteSL commented 9 years ago

This is due to accommodating the Windows test suite use of WinAPI. I will update JTermiosImpl to not encapsulate the CreateFile function in a try/catch bracket. WinAPI was supposed to be only used by JTermiosImpl but the Windows test suite breaks that base rule.

PeteSL commented 9 years ago

The reason for the change in how we handle Windows functions using the JNA-recommended route of LastErrorException instead of getLastError is because of another issue we ran into where getLastError was being set to 0 for no apparent reason where LastErrorException would not be. Unfortunately, this was done solely focused on JTermiosImpl which and not the Windows test suite which makes direct WinAPI calls. I will change the code for CreateFile in JTermioImpl but if the Windows test app fails on anything else, I am going to recommend it be either removed or updated to use LastErrorException since this is the only test app that access WinAPI directly and the focus should be on the PureJavaComm interface.

nyholku commented 9 years ago

I've reverted my local windows.JTermiosImpl to the original (before Direct Mapping) state and then retrofitted it to use the DM WinAPI and the new FDSet plus incorporated the original fix for the WaitCommEvent()==false GetLastError()==0 so that the LastErrorException is not used anywhere and now it passes all tests.

I did this because this morning when I finally fixed OP's error I run into next one and then another one and eventually it worked but I still got stack trace on the (err stream) console when some LastErrorExceptions were thrown (for cases that were really not errors but normal non zero return values) and I could not find how to disable those.

Also all these problems we have had with the introduction of LastErrorException convinced me that it was an error to introduce it to a code base that was coded with GetLastError in mind. Number of bugs for marginal gain and who know what corner case lurk in there.

Now I have a code that is very close the original design (implying high probability that it works for all cases that the original code worked for) and which that the DirectMapping and improved FDSet and which passes all tests.

PeteSL commented 9 years ago

I wish you hadn't done that because you will introduce more problems than help. The primary issue is the Windows test suite code which shouldn't be there anyway as that is not the focus of PureJavaComm. I made and pushed the change to JTermiosImpl already and that works without issue. We don't know where else problems that were seen with WaitCommEvent might also occur where you were circumventing JNA's use of GetLastError. I guarantee (after looking at the JNA native code) that if you try calling GetLastError directly, it will be replaced by their Native.getLastError() (per JNA's response) and we don't know at this point what is in the code that could be resetting Native.getLastError() to zero at WaitCommEvent. The code that I did for LastErrorException is working for all JTermios/PureJavaComm tests and that is where our focus should be. I have generated a pull request which brings JTermiosImpl CreateFile back to original but let's leave the rest alone as it is working correctly. If there are other issues with the Windows test suite, it should be looked at changing from now on instead of the base code.

nyholku commented 9 years ago

I'm surprised that you did not run into the test suite issues that I ran into. I implemented the same fix you just pushed and still had problems with Test9 and Test11.

As to circumventing GetLastError, that was not intentional and harmless and I've retained that change in my code.

The only real issue we have seen is with WaitCommEvent and in retrospect I think introduction of LastErrorException was too risky and sweeping way to handle an anomaly that we were able to handle with just a single if-clause. I find it hard to believe that Native.getLastError is broken (or if it is, then that should be fixed), surely this project cannot be the only/first one to run into this problem.

Something mysterious going on there for sure but no reason to expect that anything else is broken, I'm much more concerned about the broken control flow and the general ugliness of retrofitting the LastErrorException. I should have rejected that in the first place.

I'm not worried about the Windows test suite.

At the moment I'm not prepared to allow the LastErrorException into my code base, but I will take a time out and return to this in the evening.

(BTW I think we forget to cross check both your and the OPs JNA versions, mine is 4.0.0)

PeteSL commented 9 years ago

I did not run the test suite as it was unnecessary for my installation. I use PureJavaComm because it emulates the original javax.comm interface and that is where my focus is. As I recognized the Windows test suite was making direct calls to WinAPI, I felt it was not necessary to use since the stated purpose of PureJavaComm's JTermios implementations was to implement, without error, those functions, not necessarily directly test various system implementations directly. I would guess the Windows test suite will fail and continue to fail with the LastErrorException adds but this is the only way we know for sure that the proper error will be passed out of the JNA code. I am using the current JNA release 4.1.0 because that is what users will download. If the 4.0.0 getLastError() handling was different (like supporting the persistLastError()), then that would definitely show why some of the issues were seen and makes using LastErrorException even more important since that always has persisted the error code in the exception itself. JNA has been pushing the LastErrorException direction for quite some time now, I would guess, because of the various implementation limitations and variances and whether the last error was persisted or not. I am guessing they are trying to remove ambiguity as well as bringing it more in line with Java coding "standards". Personally, I am not partial either way other than wanting to be sure PureJavaComm works today and going into the future with JNA.

st-arnold commented 9 years ago

I had the same result with JNA 3.5.1 and 4.0.0 jars (the two that come shipped under lib dir). And if it's of any help, I'm using Java version 1.8.0_60-b27 (64-bit).

nyholku commented 9 years ago

Just to make sure we are on the same page.

I did NOT talk about and did not run and don't care about jtermios.windows.testsuite.TestSuite.

I DID talked about and did run the purejavacomm.testsuite.TestSuite.

I'm ready scrap jtermios.windows.testsuite.TestSuite and make the WinAPI visibility 'package' (default) in effect making it private.

But I'm not ready to use LastErrorException yet.

I will consult the JNA list for about the getLastError / LastErrorException usage.

PeteSL commented 9 years ago

But the purejavacomm.testsuite calls the Windows testsuite so that is where the disconnect is. If you revert back, simply change back the try/catch brackets to be tests (since those tests are already there). Add the following line to the beginning of -every- jTermiosImpl static initializer (good that they got moved to a static initializer so we can control sequence):

Native.setPreserveLastError(true);

This ensures Native.getLastError() in -all- platforms and -all- versions of JNA does actually preserve the last error for direct mapping. In the older JNA versions, this is the description of setPreserveLastError (from 3.5.1 JavaDoc, 4.0 says it is always true as does 4.1):

Deprecated. The preferred method of obtaining the last error result is to declare your mapped method to throw LastErrorException instead. Set whether the system last error result is captured after every native invocation. Defaults to true (false for direct-mapped calls).

PeteSL commented 9 years ago

Leave fail() the way it is and and call it with new LastErrorException(Native.getLastError()). That way you get a full stack list instead of just a line number.

nyholku commented 9 years ago

Where did you get:

Set whether the system last error result is captured after every native invocation. Defaults to true (false >for direct-mapped calls).

What I found from here:

http://twall.github.io/jna/4.1.0/

is

Deprecated. Last error is always preserved This method is obsolete. The last error value is always preserved. NOTE: The preferred method of obtaining the last error result is to declare your mapped method to > throw LastErrorException.

Your quote indicates that indeed the original problem was introduced by the Direct Mapping because the setPreserverLastError defaults to false for direct mapping but what I read is that it is always preserved and thus this obsolete method is not needed and it should be safe to use getLastError.

PeteSL commented 9 years ago

As stated "(from 3.5.1 JavaDoc, 4.0 says it is always true as does 4.1)". The problem with that, however, is that the problem was reported with JNA 4.0 which says that it is always true just like 4.1. I will say that I saw issues with 4.1 when I first started testing (as I described in that issue) yet the JNA native code doesn't even look at at that setting anymore nor does the Java code so that is why we went the route of LastErrorException: that setting is ignored in JNA 4+ yet we were still seeing the issue in 4 and 4.1. However, it doesn't hurt to have it there and might help with people using older versions of JNA.

nyholku commented 9 years ago

But the purejavacomm.testsuite calls the Windows testsuite

I don't see that.

nyholku commented 9 years ago

However, it doesn't hurt to have it there and might help with people using older versions of JNA.

Yes, sure, I will include that.

PeteSL commented 9 years ago

You are correct but a FailException should have been thrown anyway in the original code (for some reason, I thought we were looking at the Windows test suite) because fail() is being called in the original code. So I guess I do not understand why it was failing. If there is an issue with the FailException not being thrown correctly with the new code, then maybe we should have looked there first instead of throwing out the LastErrorException code.

PeteSL commented 9 years ago

@nyholku BTW, one of the things I am suspicious of is JNA's attempt to use a ThreadLocal variable in native code. I will look at their Java code to see how they are actually allocating and accessing it.

PeteSL commented 9 years ago

@nyholku Maybe the issue isn't with the LastErrorException but with my implementation of the fail() method? I would take a look at that before you start pulling all of the LastErrorException code. As I say, we saw the reset-to-zero issue with 4.0 and 4.1 and the only sure way of getting the true last error code is using the exception. Apparently they have a thread-local issue even though they say they don't.

PeteSL commented 9 years ago

@nyholku And if this is happening with Windows, could it be happening with the 'ix implementations too? Hmmm, I don't see why it couldn't happen since they all use the same call/check error code. While this issue is in JNA, we have to deal with it properly to make sure we cover old and new JNA implementations...

nyholku commented 9 years ago

@PeteSL sorry about the radio silence,just got home and I need a time out, I'll be back tomorrow.

PeteSL commented 9 years ago

@nyholku I fully understand. Get some rest. I think I understand why the problem with the test suite and it stems to the exceptions that the Fail exception gets translated to. I guess we were doing things right with the LastErrorException but the code that translates the Fail exception to the PureJavaComm exception (like NoSuchPortException) wasn't accommodated. I would much rather see us migrate to using the LastErrorException throughout since we have a known JNA flaw, transient as it might be, and modify the Fail exception "translation" code to generate the correct exception based on the new Fail exception format. Just food for thought. Get some sleep.

PeteSL commented 9 years ago

@nyholku To be honest, I don't see how this ever succeeded. I am looking at the original code (before I ever touched it) and the Windows JTermiosImpl.open() always threw a Fail exception which should have failed the test:

                        m_Comm = CreateFileW(new WString(filename), GENERIC_READ | GENERIC_WRITE, 0, null, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, null);

                        if (INVALID_HANDLE_VALUE == m_Comm) {
                            if (GetLastError() == ERROR_FILE_NOT_FOUND)
                                m_ErrNo = ENOENT;
                            else
                                m_ErrNo = EBUSY;
                            fail();
                        }

Please tell me how this ever passed the test. I am not being argumentative. I am trying to understand.

PeteSL commented 9 years ago

All of this and I found the issue was in my initial wrap of the CreateFile function in the port.open() method. If you apply my pull request, there will no longer be any issue. This really had nothing to do with catching the LastErrorException and I believe we should continue to use it throughout the code, not just in the Windows implementation because of the issue we found with how JNA might be resetting the getLastError value (we don't know where it might surface elsewhere). This is why the original wrap was wrong (from the commit #58 history listing):


-                       m_Comm = CreateFile(filename, GENERIC_READ | GENERIC_WRITE, 0, null, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, null);
+                                                try {
+                                                    m_Comm = CreateFile(filename, GENERIC_READ | GENERIC_WRITE, 0, null, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, null);

-                       if (INVALID_HANDLE_VALUE == m_Comm) {
-                           if (GetLastError() == ERROR_FILE_NOT_FOUND)
-                               m_ErrNo = ENOENT;
-                           else
-                               m_ErrNo = EBUSY;
-                           fail();
-                       }
+                                                    if (INVALID_HANDLE_VALUE == m_Comm) {
+                                                        m_ErrNo = EBUSY;
+                           throw new LastErrorException(0);
+                                                    }
+                                                } catch (LastErrorException lee) {
+                                                    fail(lee);
+                                                }

Should have been:

+                                                } catch (LastErrorException lee) {
+                            if (lee.getErrorCode() == ERROR_FILE_NOT_FOUND)
+                               m_ErrNo = ENOENT;
+                           else
+                                m_ErrNo = EBUSY;
+                                                    fail(lee);
+                                                }

All of this because I didn't translate that portion of the code correctly. Sorry for the confusion. The last pull request will work with the CreateFile that doesn't throw a LastErrorException. If you want to re-implement the LastErrorException, I can make this change as well which will ensure we get the actual GetLastError() code.

nyholku commented 9 years ago

Good morning Pete,

the code snippet is really exactly what makes it work. I don't see how I can put that more clearly. If the comm port fails to open we fail. m_ErrNo is set to reflect the reason for the failure as reported by the GetLastError. Either the port is not there or it is in use (we don't handle access rights separately as JavaComm does not handle that either. The void open(String, int) who calls this function captures the failure and returns -1 and the 'client' code then checks the JTermios.errno() in the usual *nix manner.

Because your mod did not set the errno correct the original test failed as the PureJavaSerial.open got the impressions from the (EBUSY) that there port exists but is busy.

(BTW thisvoid open(String, int) should actually be private as it only exists to make the int open(String,int) more clear or they could be merged.

nyholku commented 9 years ago

@PeteSL I see my comments were behind times. No matter.

I've been mulling over this for a good deal and I'm not going to allow LastErrorException in my code base.

I've not pulled your latest and thus don't know how well it behaves but what drove me over the edge was yesterday when I had fixed above issues I run into this:

Test11 - exit from blocking read/write .com.sun.jna.LastErrorException: [996] Overlapped I/O event is not in a signaled state. at jtermios.windows.WinAPI$Windows_kernel32_lib_Direct.GetOverlappedResult(Native Method) at jtermios.windows.WinAPI.GetOverlappedResult(WinAPI.java:779) at jtermios.windows.JTermiosImpl.write(JTermiosImpl.java:520) at jtermios.JTermios.write(JTermios.java:433) at purejavacomm.PureJavaSerialPort$1.write(PureJavaSerialPort.java:636) at purejavacomm.testsuite.Test11$2.run(Test11.java:94) at java.lang.Thread.run(Unknown Source)

At that point in time I did not want to dig any deeper and took this as evidence of more logic broken because of the LastErrorException breaking the control flow.

In summary I wont use LastErrorException because:

I'm sorry you have put so much time into this LastErrorException, I should have refused this in the first place.

You've done sterling work everywhere but I just can't see my way to allowing LastErrorException here.

I guess we need to agree to disagree on this one.

PeteSL commented 9 years ago

@nyholku I understand your frustration (I was not complete in my implementation of the try/catches) but also understand that the getLastError could be reset to zero in any instance. This was not unique to the WaitCommEvent call since it is working correctly. Something in JNA where this error code is supposedly kept in a thread-local variable is resetting it to zero and the LastErrorException is the only sure way of getting the real code generated. Interesting on the error because that is something I made sure was handled correctly.

                                try {
                        if (!WriteFile(port.m_Comm, port.m_WrBuffer, length, port.m_WrN, port.m_WrOVL))
                                            // Should never get here so tell us
                                            throw new LastErrorException(0);
                                } catch (LastErrorException lee) {
                                    if (lee.getErrorCode() == 0)
                                        System.err.println("ReadFile returned false but GetLastError==0");
                                    else if (lee.getErrorCode() != ERROR_IO_PENDING)
                                        port.fail(lee);
                                    port.m_WritePending = length;
                                }

so the port.m_WritePending only gets set if there is a write pending and is reset to zero if the overlapped operation completes (I did not change that portion of the code at all). (I know, copy/paste got me with the note about "ReadFile returned false but GetLastError==0" but that is just cosmetic.)

As I say, I am not tied to LastErrorException and if you want to revert back, please do so with the understanding that we have identified an issue in JNA where the Native.getLastError() function may not return the actual value that was returned immediately after the Windows function call and that the static initializers need to have the Native.setPreserverLastError(true); line in all JTermiosImpl to ensure compatibility with pre-4.0 JNA code. Also, since we identified the JNA issue with WaitCommEvent, be sure to consider GetLastError() to be ok if it is zero at that point. You may want to return to this in the future if direct mapping, which is significantly faster on all platforms, continues to present last error issues in your code. I will continue to support your excellent efforts with this product.

PeteSL commented 9 years ago

@nyholku I also note that you do not check which event triggered the WaitMultipleObjects in that part of the code. Could it be that the port got closed before the operation completed? That would cause the WaitMulipleObjects to fire but not have an Overlapped result. Just thinking out loud again... Keep in mind that with direct mapping, things happen a lot faster and race conditions are going to be more visible. In point of fact, that is exactly what is happening. You are closing the port while a write is pending. It is responding exactly as expected.

PeteSL commented 9 years ago

@nyholku Your test didn't fail! It simply reported that the write operation did not complete! I add printStackTrace code to read, write, and open so we could see the actual error being generated! Again, the test didn't fail as it returned -1 without error but at least we now know what caused it!

PeteSL commented 9 years ago

@nyholku I understand your frustration but you made an invalid assumption that the write code was broken; it is not. If you don't want to see the stack traces on errors, remove them but don't assume the lasterrorexception handling is broken if the tests are not failing, just reporting what the underlying error was. Do as you will but I still believe the only way you will ensure proper operation of the direct mapping code is with LastErrorException because of the known JNA implementation issue we have identified.

nyholku commented 9 years ago

Do as you will but I still believe the only way you will ensure proper operation of the direct mapping code is with LastErrorException because of the known JNA implementation issue we have identified.

This is what the JNA author T Wall had to say about this:

There have been a few reports of uncertainty around GetLastError, but no one's produced anything concrete and not attributable to w32 API inconsistency.

PeteSL commented 9 years ago

Yes, and we were able to produce a concrete example on 1 computer in one environment and duplicate it in another environment but no longer reproducible on that environment. The code I put in for the LastErrorException is also reporting anytime an error occurred and getLastError is reset to zero or, in the case of WaitCommEvent, where an error was expected but did not occur. That way we could definitively track this phenomena and see if it was JNA resetting the Java variable or GetLastError() actually being reset in the native code. In the long run, we could be doing the JNA folks a favor by leaving the LastErrorException code in there as it will tell us if GetLastError() is returning 0 when it shouldn't be (but we are considering it bad anymore so PureJavaComm won't break).

PeteSL commented 9 years ago

@nyholku I see you reverted away from every looking for LastErrorException and I believe that was extreme the other direction for my 2 coding errors. This is not something directly tied to the WaitCommEvent but tied to why GetLastError() is being set to zero (success) after the WaitCommEvent returns. If this is tied to concurrent operations using direct mapping, then we might see it again with ReadFile, WriteFile, and GetOverlappedIO (would be my guess). WaitMultipleObjects is still in the regular JNA mapping class so that will likely not have the issue (we saw this with the implementation of direct mapping which may just be a coincidence but...). I would recommend we revisit the throw LastErrorException for those functions (and WaitCommEvent) and simply move the return false error handling to the catch part of the try/catch where those functions are called. This will minimize the likelihood of coding errors (which I acknowledge that my coding errors were the sole source of problems with the prior implementation) and ensure we get the true code returned by GetLastError().

I didn't see the add of Native.setPreserveLastError(true); to the static initializers (but I admit I didn't look too closely yet). I would do that just to ensure pre-4.0 JNA users will properly work with the Direct Mapping implementations.

nyholku commented 9 years ago

On 12 Sep 2015, at 19:07, PeteSL notifications@github.com wrote:

@nyholku I see you reverted away from every looking for LastErrorException and I believe that was extreme the other direction for my 2 coding errors. This is not something directly tied to the WaitCommEvent but tied to why GetLastError() is being set to zero (success) after the WaitCommEvent returns. If this is tied to concurrent operations using direct mapping, then we might see it again with ReadFile, WriteFile, and GetOverlappedIO (would be my guess). WaitMultipleObjects is still in the regular JNA mapping class so that will likely not have the issue (we saw this with the implementation of direct mapping which may just be a coincidence but...). I would recommend we revisit the throw LastErrorException for those functions (and WaitCommEvent) and simply move the return false error handling to the catch part of the try/catch where those functions are called. This will minimize the likelihood of coding errors (which I acknowledge that my coding errors were the sole source of problems with the prior implementation) and ensure we get the true code returned by GetLastError().

I prefer to wait and see if we get problems, if we do I'll eat the humble pie and admit I made a wrong decision.

I didn't see the add of Native.setPreserveLastError(true); to the static initializers (but I admit I didn't look too closely yet). I would do that just to ensure pre-4.0 JNA users will properly work with the Direct Mapping implementations.

My bad, I had done all this before that subject came up and forget as I've rather busy past two days. I will add that to next release (hopefully within days).

PeteSL commented 9 years ago

@nyholku I understand. BTW, I think I have identified why Native.getLastError() is being reset to zero: it is NOT thread local storage, it is in -process- local storage. The JNA code was originally written for 'ix platforms which supports something called pthread functions. However, when you load a DLL in Windows, it is loaded by the process:

BOOL WINAPI DllMain(HINSTANCE hDLL, DWORD fdwReason, LPVOID lpvReserved) {
  switch (fdwReason) {
  case DLL_PROCESS_ATTACH:
    tls_thread_data_key = TlsAlloc();
    if (tls_thread_data_key == TLS_OUT_OF_INDEXES) {
      return FALSE;
    }
    break;
  case DLL_PROCESS_DETACH:
    TlsFree(tls_thread_data_key);
    break;
  case DLL_THREAD_ATTACH:
    break;
  case DLL_THREAD_DETACH: {
    thread_storage* tls = (thread_storage*)TlsGetValue(tls_thread_data_key);
    if (tls) {
      dispose_thread_data(tls);
    }
    break;
  }
  default:
    break;
  }
  return TRUE;
}

Note they are creating "thread local storage, TLS" for the -process- attach, not the thread attach because they are using a static variable in callback.c for tls_thread_data_key, not a thread-local variable. Bottom line is if they wanted -Java- thread-local variables, they should have accessed ThreadLocal variables in Java, not -assume- that every JVM creates a separate native thread for each thread (they don't) and then they should have realized they aren't creating a thread-local variable for Windows, it is a process-local variable.

That explains why it was rare and why throw LastErrorException always works (that code is on the stack, always thread-local) and Native.getLastError() can fail on Windows (I don't know enough about pthreads and 'ix dynamic library calls to guess whether it might be an issue with them). The reason we hadn't seen it before (or didn't realize it) is because Direct Mapping is so much faster that the likelihood of getLastError() being set to something new in a different thread is significantly higher than with JNA mapping.

I know you are tired of the LastErrorException issue but it was bothering the heck out of me as to why LastErrorException would never fail (be set to something else) but Native.getLastError() could fail. We now know the answer and why the only reliable catch of Windows GetLastError() is with the LastErrorException.

PeteSL commented 9 years ago

@nyholku I will be happy to revisit using throw LastErrorException and -this- time simply use try/catches to wrap each function in WinAPI. Also, to make it as transparent as possible, I will use a ThreadLocal variable for last error in WinAPI so there won't need to be any code changes outside of WinAPI! Does that sound like a reasonable compromise?

nyholku commented 9 years ago

On 12 Sep 2015, at 20:58, PeteSL notifications@github.com wrote:

@nyholku I understand. BTW, I think I have identified why Native.getLastError() is being reset to zero: it is NOT thread local storage, it is in -process- local storage. The JNA code was originally written for 'ix platforms which supports something called pthread functions. However, when you load a DLL in Windows, it is loaded by the process:

BOOL WINAPI DllMain(HINSTANCE hDLL, DWORD fdwReason, LPVOID lpvReserved) { switch (fdwReason) { case DLL_PROCESS_ATTACH: tls_thread_data_key = TlsAlloc(); if (tls_thread_data_key == TLS_OUT_OF_INDEXES) { return FALSE; } break; case DLL_PROCESS_DETACH: TlsFree(tls_thread_data_key); break; case DLL_THREAD_ATTACH: break; case DLL_THREAD_DETACH: { thread_storage* tls = (thread_storage*)TlsGetValue(tls_thread_data_key); if (tls) { dispose_thread_data(tls); } break; } default: break; } return TRUE; }

Note they are creating "thread local storage, TLS" for the -process- attach, not the thread attach because they are using a static variable in callback.c for tls_thread_data_key, not a thread-local variable. Bottom line is if they wanted -Java- thread-local variables, they should have accessed ThreadLocal variables in Java, not -assume- that every JVM creates a separate native thread for each thread (they don't) and then they should have realized they aren't creating a thread-local variable for Windows, it is a process-local variable.

I think you should rehash this in the JNA list both to check the validity of your conclusion and alert them of this issue so they can fix it.

PeteSL commented 9 years ago

@nyholku I will but I also think the modification to WinAPI would be effective and would be guaranteed to work now with PureJavaComm without rewriting anything other than the WinAPI methods and would work with any JNA version, even if JNA fixes this in the future.

nyholku commented 9 years ago

On 12 Sep 2015, at 21:03, PeteSL notifications@github.com wrote:

@nyholku I will be happy to revisit using throw LastErrorException and -this- time simply use try/catches to wrap each function in WinAPI. Also, to make it as transparent as possible, I will use a ThreadLocal variable for last error in WinAPI so there won't need to be any code changes outside of WinAPI! Does that sound like a reasonable compromise?

— Reply to this email directly or view it on GitHub.

Can you give me an example how you would go about it?

In the static method you wrap the call to the native function in a try/catch LastErrorException, capture the last error from the exception and keep it in a thread local variable; and the modify the GetLastError to get the error code from there?

PeteSL commented 9 years ago

@nyholku Exactly. And if the exception is thrown, the static method will return false, just like it is supposed to.

PeteSL commented 9 years ago

@nyholku That is a quick and painless fix that will work today and tomorrow with any JNA version.

nyholku commented 9 years ago

@PeteSL I can live with that! This approach seems very safe, so go ahead, I'll merge it.

But also consider letting the JNA people know about the bug ... or you mind if I do?

PeteSL commented 9 years ago

@nyholku Since you already have this issue open with them, feel free to take it to them. I think it could be an issue with 'ix platforms as well if the JVM doesn't do native threads for each Java thread but that is just conjecture on my part. It is definitely an issue with Windows as we have seen. I will rebase to 0.0.27, make the mods and also add the Native.setPreserverLocalError(true); lines to the static initializers (harmless as well and will help people using pre-4.0 JNA).

nyholku commented 9 years ago

@st-arnold Please test version 0.0.28 which has PeteSL's latest changes merged and report back.

twall commented 9 years ago

hi all, I hope you'll excuse my interjection, I'd like to tackle this closer to its source.

On w32, JNA establishes a thread-local index on process attach. The actual thread-local storage is lazily created by any subsequent thread actually asking for access to it.

The JVM generally does do a native thread per JVM thread, and creates JVM Thread objects when a purely native thread does an "attach" operation. Even if the JVM were managing its own multi-thread execution within a single native thread, that type of multi-tasking would not be capable of pre-empting in the middle of native code outside its control.

It'd be great if I could reproduce this in (relative) isolation, if for no other reason than to be able to verify if/when the issue is fixed.

twall commented 9 years ago

Please note that any direct-mapped calls to GetLastError() should be replaced by Native.getLastError(). Where JNA could intercept such calls with interface mapping (and thus get you the result you really want), it does not do so with direct mapping. Calling the API method directly will not necessarily provide you with the error set on the last API call, while Native.getLastError() will.

PeteSL commented 9 years ago

I appreciate your endeavor to analyze the JNA win32 code but contrary to your analysis, it does not create TLS storage on a thread basis; it is allocated on a process basis (unfortunately). We do not use a native mapped GetLastError() nor have we used a direct call to GetLastError() since initial implementation of the direct mapping. We were able to prove the Native.getLastError() value was being overwritten in certain Windows environments at the WaitCommError call in the select() code. Key word here is "certain" because this is a timing issue where another thread overwrites the Native.getLastError() value before it is checked in this thread (this issue was proven and is semi-acknowledged by the JNA group, "rumors but no concrete examples"). We have also proven that using LastErrorException where the last error code is saved on the stack and then put directly into the exception (undeniably thread-local) ensures the code is not altered.

This puts us back to release 0.0.28 uses LastErrorException exclusively for the WinAPI code, uses a definitively Java ThreadLocal value to hold the last error code for calls to WinAPI.GetLastError() (not Windows GetLastError() or Native.getLastError()) which works in all cases with all recent versions of JNA (at least from 3.5.1 on).

twall commented 9 years ago

The TLS index is created on a per-process basis (via TlsAlloc). This establishes an identifier that may be used across all threads to modify the contents of a thread-specific "slot"[1].

The actual reference to storage (maintained via TlsGetValue() and TlsSetValue() using the per-process TLS index) is definitely created and destroyed on a per (native) thread basis.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/ms686991(v=vs.85).aspx

PeteSL commented 9 years ago

Tim, I am going to ask you to pursue this on the JNA Github group. As I stated, we have proven that Native.getLastError() could be overwritten by another thread (or something else in JNA) and that 0.0.28 uses the LastErrorException and stores the thrown error code (passed on the stack to the point the exception is thrown so it -can't- be overwritten) in a Java ThreadLocal variable. This is known to never overwrite the variable from another thread and PureJavaComm works correctly.

I have been writing Windows serial communications software for over 2 decades and am very aware of how all of the Windows APIs work. I do not need references to MSDN library pages (I am a MSDN author for a many decades). I understand your desire to defend the JNA code but we have proven the case where the "native thread local" storage is apparently being overwritten and we also have a proven method to avoid such issues in all JVMs and native implementations.

By the way, you said that if the JVM is internalizing the threads, it can't interrupt the process. True, it won't interrupt the native process but there is certainly nothing that prevents it from interrupting the Java method between lines. Again, not trying to pursue this any further here since we have a working solution. Any further pursuit of this should be done with the JNA group.