micro-manager / pymmcore

Python bindings for MMCore, Micro-Manager's device control layer
https://pypi.org/project/pymmcore/
GNU Lesser General Public License v2.1
32 stars 8 forks source link

Hard crash with `waitForConfig()` timeout #90

Open hinderling opened 8 months ago

hinderling commented 8 months ago

I have issues with a Nikon reflector turret TIFilterBLOCK1 timing out leading to kernel crashes. Setup is the following:

for i in range(20): #multiple loops as timeout happens randomly
    for channel in channels: #loop over different configs
        mmc.setConfig('Channel',channel)
        mmc.waitForSystem / mmc.waitForConfig # where I get the error with ~10% chance

Seems like in waitForSystem() timeout errors are propagated from waitForDevice()->waitForDeviceType()->waitForSystem()

While in waitForConfig() the errors are caught:

void CMMCore::waitForConfig(const char* group, const char* configName) throw (CMMError)
{
   CheckConfigGroupName(group);
   CheckConfigPresetName(configName);

   Configuration cfg = getConfigData(group, configName);
   try {
      for(size_t i=0; i<cfg.size(); i++)
         waitForDevice(cfg.getSetting(i).getDeviceLabel().c_str());
   } catch (CMMError& err) {
      // trap MM exceptions and keep quiet - this is not a good time to blow up
      logError("waitForConfig", err.getMsg().c_str());
   }
}

So not interrupting and just logging the error with waitForConfig() seems to work as designed in MMCore (although this behaviour was unexpected to me). But could this be linked to the subsequent pymmcore crashes?

marktsuchida commented 8 months ago

Hmm. waitForConfig swallowing the exception is definitely problematic (though we'd have to consider backward compatibility before changing it -- looks like it's been this way since before 2007).

However, it is not clear why discarding the exception would cause a subsequent crash. That seems like it could be a device adapter issue, though it is hard to tell with just this information. For example, it could be that the NikonTI device adapter crashes when a move is initiated before the previous move has completed.

Any chance you can reproduce this under conditions that can display a C++ backtrace? Perhaps the easiest is to run the equivalent BeanShell script in the MM application and look at the hs_err_pid*.log file that is generated after the crash. This will at least indicate which DLL the crash occurred in.

group = "Channel";
chans = mmc.getAvailableConfigs(group);
for (int i = 0; i < 20; i++) {
    for (int ch = 0; ch < chans.size(); ch++) {
        chan = chans.get(ch);
        mmc.setConfig(group, chan);
        mmc.waitForConfig(group, chan);
    }
}

Also, you can change the wait timeout from 5000 ms to something greater by setting the Core-TimeoutMs property. This might be a good workaround if the turret is just taking a little over 5 seconds to reach the position (and might be desirable even if we fix the crash).

hinderling commented 8 months ago

Thanks for the hints! You are right the crash is not caused by swallowed exceptions, I was able to reproduce the crash with both mmc.waitForSystem() and mmc.waitForConfig(). In the end I was able to avoid the issue by removing SutterLambda 10-3 Wheel-C from the config; somehow having Wheel-C and TIFilterBlock1 in the same config caused the crash on this microscope (not if they were set individually). Also a snapImage() call is needed for it to crash, and the script seems to crash on setConfig() after a snapImage() call.

There seem to be some timing issues:

Below some crashlogs from the BeanShell scripts: hs_err_pid4108.log hs_err_pid8324.log hs_err_pid9340.log

As I was able to reproduce with the MicroManager Java GUI, this is certainly not a pymmcore issue. Feel free to close this issue. I can also open a new one that just describes the unexpected error swallowing behaviour of mmc.waitForConfig(), if this is something that should be changed.

marktsuchida commented 8 months ago

Interesting. It's unfortunate that part of the call stack is lacking symbols, but the fact that it goes through MSVCP100.dll (the C++ runtime for Visual Studio 2010) might be a hint, because most of MM no longer uses it (except for some device driver DLLs that we link to).

The JVM itself appears to use msvcr100.dll (the C runtime) but not MSVCP100.dll, based on what it ships. SutterLambda does not use MSVCP100 and I don't think NikonTI does, either. I'm not sure about AndorSDK3, whose DLLs I don't have at hand (you can check if you like).

Given that setConfig calls SetProperty on the devices, it might be useful to determine which device threw the uncaught C++ exception, since I get the feeling that your current workaround is only masking the error -- we wouldn't want it to come back.

Probably the fastest way is to run the test in MMStudio with the Problem Report (Help menu) capturing the log (it's "crash-robust logging" will prevent the last few log entries from being lost -- though it has a small chance of influencing timing-sensitive behavior). The report/log can be viewed when restarting MMStudio after the crash. If our guess that it comes from a device is correct, the log should end with a line saying "Will set property ..." for the device.


Would be great if you could create an issue for waitForConfig() swallowing exceptions -- in the mmCoreAndDevices repo. It's hard to say what we can do about it because changing the behavior will almost certainly break existing code, but we can consider migration paths.