python-microscope / microscope

Python library for control of microscope devices, supporting hardware triggers and distribution of devices over the network for performance and flexibility.
https://www.python-microscope.org
GNU General Public License v3.0
66 stars 38 forks source link

LinkamCMS segfaults in Linux #273

Open carandraug opened 1 year ago

carandraug commented 1 year ago

As reported in https://forum.image.sc/t/driving-a-linkam-cms196-with-python-microscope/78773/2 LinkamCMS fails to initialize in Linux:

Using microscope we do not succeed in completing init_sdk() – it manages to load the DLL (although we hard to write the path in hard)

__class__._lib = ctypes.CDLL("libLinkamSDK.so")

We’re then able to read the license correctly, however when we get to:

   cfunc = ctypes.CFUNCTYPE(_uint32_t, _CommsHandle, _ControllerStatus)(
       __class__._on_new_value
   )

We get this output:

*** stack smashing detected ***: terminated
Aborted

I've tried in Linux, even without hardware, and can reproduce the issue so the problem is not on the SDK.

carandraug commented 1 year ago

Here's a minimal code example showing the issue:

$ cat reproduce-273.py 
import ctypes

class _ControllerStatusFlags(ctypes.Structure):
    """ControllerStatus.flags struct from C headers"""

    _fields_ = [
        ("controllerError", ctypes.c_uint, 1),
        ("heater1RampSetPoint", ctypes.c_uint, 1),
        ("heater1Started", ctypes.c_uint, 1),
        ("heater2RampSetPoint", ctypes.c_uint, 1),
        ("heater2Started", ctypes.c_uint, 1),
        ("vacuumRampSetPoint", ctypes.c_uint, 1),
        ("vacuumCtrlStarted", ctypes.c_uint, 1),
        ("vacuumValveClosed", ctypes.c_uint, 1),
        ("vacuumValveOpen", ctypes.c_uint, 1),
        ("humidityRampSetPoint", ctypes.c_uint, 1),
        ("humidityCtrlStarted", ctypes.c_uint, 1),
        ("lnpCoolingPumpOn", ctypes.c_uint, 1),
        ("lnpCoolingPumpAuto", ctypes.c_uint, 1),
    ]

cfunc = ctypes.CFUNCTYPE(ctypes.c_uint64, _ControllerStatusFlags)
cfunc(lambda x: 1)
$ python3 reproduce-273.py 
*** stack smashing detected ***: <unknown> terminated
Aborted

Looking at ctypes docs I see this warning:

Warning ctypes does not support passing unions or structures with bit-fields to functions by value. While this may work on 32-bit x86, it’s not guaranteed by the library to work in the general case. Unions and structures with bit-fields should always be passed to functions by pointer.

And indeed that's we are doing in:

        cfunc = ctypes.CFUNCTYPE(_uint32_t, _CommsHandle, _ControllerStatus)(
            __class__._on_new_value
        )

_ControllerStatus is a union and one of its fields is the _ControllerStatusFlags bit-field struct. This only crashes if the _ControllerStatusFlags struct has more than 12 fields.

carandraug commented 1 year ago

This issue is happening when setting the callback for status events:

typedef void (*EventNewValueCallback)(CommsHandle hDevice, LinkamSDK::ControllerStatus status);

We can't pass unions and structures and bit-fields by value. But that's the interface for the callback so we can't just change it pointer.

But I think we can drop the bit-fields ourselves. ControllerStatus is a union where one member is a struct with access to each bit and a 64bit int that holds all the flags:

    union ControllerStatus
    {
        struct
        {
            unsigned    controllerError                 : 1;
            // ... 63 other bit-fields
        }               flags;                                  ///< Accessor to the flags.
        uint64_t        value;                                  ///< Flags as a single value;
    };

If the problem is the bit-field, we should be able to remove the flags struct from the ControllerStatus union and simply pass a uint64_t and manually access the bits we care about ourselves.

Unfortunately, the Linkam API does this a lot so we probably need to do the same in a bunch of other places

I do not have access the hardware to do this though.

carandraug commented 1 year ago

Test fix in my 273-controller-status-value. Needs testing in real hardware and probably the same hack needs to be applied to all other unions in LinkamCMS. A nicer fix is likely to be possible but without hardware access I won't be making any deep changes.

iandobbie commented 1 year ago

Any real hardware tests will need either Tom or Jingyu to contribute. However I'm not sure either of them runs Linux anywhere. Do you think this is limited to Linux or will it also be an issue on Windows?

edwardando commented 1 year ago

I will try it as soon as I get access to the stage again and will keep you posted

edwardando commented 1 year ago

Hi there, this is a little complex but:

  1. On line 430... _ControllerStatusValue = _uint64_t should probably be _ControllerStatusValue = _uint64_t

  2. We have an error when making this correction, but it's due to recent developments and not the commit in itself (i.e., we don't have this error with the 0.6 version from pip, but we do have it with the latest code from master):

    TypeError: __init__() missing 1 required positional argument: 'index'
    Exception ignored in: <function _LinkamBase.__del__ at 0x7f098ebfef70>
    Traceback (most recent call last):
    File "/home/lnd/LaserMeltingMicroscopeControl/python-microscope-carandraug-venv/lib/python3.8/site-packages/microscope/stages/linkam.py", line 1107, in __del__
    self._process_msg(Msg.CloseComms)
    File "/home/lnd/LaserMeltingMicroscopeControl/python-microscope-carandraug-venv/lib/python3.8/site-packages/microscope/stages/linkam.py", line 1119, in _process_msg
    if not self._lib.linkamProcessMessage(
    AttributeError: 'NoneType' object has no attribute 'linkamProcessMessage'
edwardando commented 1 year ago

Starting again from the 0.6.0 tag, and applying your changes we definitely get further, but we're getting a segfault in open_comms():

    self._process_msg(
        Msg.OpenComms,
        byref(self._commsinfo),
        byref(self._h),
        result=self._connectionstatus,
    )

...with both versions of the .so file. Any thoughts? Thanks for your help!

carandraug commented 1 year ago

Before fixing that, I'd prefer we fix the reason why you can't use the current development version (otherwise I will need to fix the segfault in 0.6.0 and then "forwardport" it to the development). From the error message, I think the error comes because you don't specify the index argument when you construct the stage.

Can you try to construct it with LinkamCMS(uid="", index=0) and see if that moves it forward?

edwardando commented 1 year ago

Yes, will do as soon as I can access the linkam stage again!

edwardando commented 1 year ago

OK the index=0 seems to improve things, we are now able to work from your branch with the fix to line 430 mentioned above.

We're hacking the paths to the .so file and the .lsk file but this is fine for testing.

Now we're able to get further, and run your script, but the status we get back from the stage seems to say "connected=False", what could that be a symptom of?

edwardando commented 1 year ago

guys?

iandobbie commented 1 year ago

Can we help?

mickp commented 1 year ago
carandraug commented 1 year ago

Adding to what @mickp said: we're redirect the logs to devnull, can you change it to some filepath (writable) and check if there's some hints there?

iandobbie commented 1 year ago

Are we sure this isn't a license issue? The SDK does require a license file to work.

edwardando commented 1 year ago

It's not a license issue, this we've checked carefully. OK we'll look into the firmware version...

iandobbie commented 1 year ago

Is there any way to borrow a windows machine and test on that. The system is usb so a loan of a laptop might help. I suggest this as the current code works well on the Diamond system and was definitely working on the Oxford one when I left 18 months ago. These both use windows and that dll. I know this may not suit your purposes but should so that the firmware etc are genarlaly working.

edwardando commented 1 year ago

I have borrowed a windows machine and confirmed that I can drive the stage. The state of affairs is as follows:

stack smashing detected ***: terminated Aborted

Is this normal?

carandraug commented 1 year ago

The UID being required is normal. in the case of Ubuntu we don't know whether that is normal since we never tested in Linux:

https://github.com/python-microscope/microscope/blob/a921847ae80d51d451a2e73af3b68070479c42b2/microscope/stages/linkam.py#L979

However, it is certainly not the desired outcome. I do not have access to a Linkam stage to fix that. Now do you know it should work in Windows, if you could fix it to work in Linux too that would be great.

edwardando commented 1 year ago

sure, but debugging a segfault like this is quite hard, I don't know where to start...

carandraug commented 1 year ago

I recommend creating a minimal example using ctypes only and without Python-Microscope. The script shouldn't be too long since there's a lot of stuff you won't need (such as setting all the callbacks and adding device settings) and you can mostly copy and paste from Python-Microscope. I would work from there to identify what it is that we're doing wrong.