High speed / high data rate acquisition fails to complete

dpshepherd commented 3 years ago

Bug report

Bug summary

Pycromanager fails to successfully complete acquisition when running at high camera frame / data rates.

Code for reproduction

This uses the demo configuration in a manner similar to our stage scanning microscope, which we have previously discussed.

# imports
from pycromanager import Bridge, Acquisition
from pathlib import Path
import time

def camera_hook_fn(event,bridge,event_queue):

    core = bridge.get_core()

    return event

def post_hook_fn(event,bridge,event_queue):

    core = bridge.get_core()

    return event

def main():

    bridge = Bridge()
    core  = bridge.get_core()

    # create ROI similar to sCMOS for fast readout
    core.set_property('Camera','OnCameraCCDXSize',1600)
    core.set_property('Camera','OnCameraCCDYSize',256)
    core.set_property('Camera','FastImage',1)
    time.sleep(10)

    # define channels to use
    channel_states = [1,0,1,1]

    # define dimensions of scan
    number_y = 10
    number_z = 3
    number_scan = 2500

    # exposure time
    exposure_ms = 2.0

    # setup file name
    save_directory=Path('/path/to/data')
    save_name = 'test'

    # set exposure
    core.set_exposure(exposure_ms)

    # initialize position counter
    position = 0

    # create events to execute scan
    events = []

    for y in range(number_y):
        for z in range(number_z):
            for c in range(len(channel_states)):
                for x in range(number_scan):
                    if channel_states[c]==1:
                        if (c==0):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'DAPI'}}
                        elif (c==1):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'FITC'}}
                        elif (c==2):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'Rhodamine'}}
                        elif (c==3):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'Cy5'}}

                        events.append(evt)

            # advance position counter
            position = position + 1

    # run acquisition
    with Acquisition(directory=save_directory, name=save_name, post_hardware_hook_fn=post_hook_fn,
                    post_camera_hook_fn=camera_hook_fn, show_display=False, max_multi_res_index=0) as acq:
        acq.acquire(events)
        acq.acquire(None)
        acq.await_completion()

# run
if __name__ == "__main__":
    main()

Expected outcome

Pycromanager should finish the acquisition.

Actual outcome

The acquisition initializes successfully, but then CPU & RAM usage increase while hard disk usage decreases. Eventually the acquisition hangs or creates a corrupt dataset (perhaps related to issue #217?) The SSDs we are using are capable of sustained ~1 GB/s writing rates, measured using Crystal Mark. The code as written requires <1.0 GB/s. We can produce the issue for ~0.3-0.6 GB/s by changing exposure time or ROI of demo camera. Once we are below ~ 0.2 GB/s, then acquisition successfully completes.

Version Info

Operating system: Windows 10 64-bit
pycromanager version: 0.7.9 (also tested with 0.7.6->0.7.9)
MicroManager version: 2.0 Nightly build 20210123
Python version: 3.8
Python environment: conda environment with only pycromanager

henrypinkard commented 3 years ago

Yeah, I think you're probably right in thinking this is related to #217. In that case, we found that some of the 4GB files in the NDTiffStorage datasets were erroneously truncated to very small sizes < 100MB, which suggests that this could be a problem with data saving class. Do you see something similar? I think figuring out exactly how the datasets are corrupted might offer a way to track this down

dpshepherd commented 3 years ago

No, we haven't seen that. In our case, we start the acquistion and the circular buffer is being emptied quickly and the write speed is as expected on the SSD. Then, the buffer starts to fill and the write speed drops. This progressively gets worse until the acquistion hangs or finishes with a corrupt file. Usually it just hangs up and stops running the acquistion. It is rare that it finishes and produces a corrupt file.

henrypinkard commented 3 years ago

Gotcha. I can try adding an option for debugging to just pull the data out of the circular buffer and just discard it, to isolate whether this is related to the data saving.

We also saw that deleting all the data on the HD where saving takes place before starting acquisition seems like it might have played a role. Also, running the acquisition with no display seemed to be essential (but I think youre probably already doing this)

dpshepherd commented 3 years ago

We tried cleaning the disk each time and it doesn't make a difference. We don't show the display.

Sounds like a good idea to isolate the circular buffer and test.

dpshepherd commented 3 years ago

Just did a quick check and I think I can replicate similar behavior using the multi-dimensional acquistion in MM 2.0.

With the display off and writing "image stack file", the acquistion keeps up with demo camera for awhile and then suddenly starts filling the circular buffer and the write speed drops. Interestingly, the write speed never recovers once the camera stops sending frames and the acquisition is just emptying the circular buffer..

One difference is that the CPU use does not go to 100% with native MM, unlike pycromanager.

dpshepherd commented 3 years ago

I spent some time messing with the garbage collector, as suggested here.

I can get the native MDA in MM to finish higher data rate acquistions using a newer openJDK and the ZGC garbage collector.

This doesn't quite transition to pycromanager. For the pycromanager test I wrote above, if the circular buffer stays relatively empty, the acquistion creates a readable dataset. If images pile up in the buffer, pycromanager exits without an error (other than the ZMQ warning when using two hooks on Windows). When I try to load the dataset, I get the IFD error.

I am not a Java expert, so I'm not sure if switching to a newer openJDK JRE and messing with the garbage collector makes sense to do.

henrypinkard commented 3 years ago

Nice work. It sounds to me like there are two problems here. 1) NDTiffStorage datasets get messed up somehow and writes corrupted datasets when data rates are high (same as #217). 2) Possibly related, these backups in emptying the circular buffer.

I tried your demo code, but I don't think my laptop is up to recreating the fast data rate errors. By the way, you shouldn't need to call acq.acquire(None) and acq.await_completion() as the with block handles this automatically.

I think that the new garbage collector certainly seems reasonable for your use case. If that is indeed the limiting factor we could ask Nico about upgrading the MM JRE.

In the mean time, you can test the source of (2) with the option to throw away data after acquiring. If you swap in NDTiffStorage 1.3.2, you can turn saving off with:

with Acquisition(directory=save_directory, name=save_name, post_hardware_hook_fn=post_hook_fn,
              post_camera_hook_fn=camera_hook_fn, show_display=False, max_multi_res_index=0) as acq:
        acq._remote_acq.get_storage().discard_data_for_debugging()
        acq.acquire(events)

dpshepherd commented 3 years ago

I tried your demo code, but I don't think my laptop is up to recreating the fast data rate errors.

I've found that sometimes laptops are faster, I think it just depends on the specific hardware and drives.

By the way, you shouldn't need to call acq.acquire(None) and acq.await_completion() as the with block handles this automatically.

Got it. I think we were playing with that because of the ZMQ error with two hooks a long time ago and never removed the code. We will get rid of it.

In the mean time, you can test the source of (2) with the option to throw away data after acquiring. If you swap in NDTiffStorage 1.3.2, you can turn saving off with:
with Acquisition(directory=save_directory, name=save_name, post_hardware_hook_fn=post_hook_fn,
              post_camera_hook_fn=camera_hook_fn, show_display=False, max_multi_res_index=0) as acq:
        acq._remote_acq.get_storage().discard_data_for_debugging()
        acq.acquire(events)

Great! should have time to test this tomorrow evening.

henrypinkard commented 3 years ago

Also, is this something that you think has been there all along, it did it crop up in a new version along the way?

dpshepherd commented 3 years ago

I think there are two separate issues:

The circular buffer filling up during a high frame / data rate acquisition is a known issue with MM. MM writes so much metadata, that it is hard to stream data at high rates. As long as there is enough RAM allocated to the buffer so that it does not overflow, MM successfully empties the circular buffer to a datastore.
For pycromanager / NDTiffStorage, I'm not sure. We've seen corrupted NDTIffStorage files back to when we first started using pycromanager. This popped up for us now because we re-aligned our microscope for more efficient photon collection. This means we can now use higher frame / data rates in these long acquisitions, which we had not tried before.

henrypinkard commented 3 years ago

Got it. My guess is then that (2) has probably been there all along, but only manifests at high-speeds then. This is definitely a high-priority thing to be fixed. While I'm at it I should be able to address #220 as well, which I think will dramtically speed up opening Datasets

henrypinkard commented 3 years ago

I just did a major refactor/cleanup of the code, which may solve this problem (or it least make it a lot easier to track down if not). Available in 1.4.0. Could you give it a try? If that works, I can add in the changes for #220, which will make the big datasets with arbitrary axes much easier to open

henrypinkard commented 3 years ago

@ieivanov any chance you could test NDTiffstorage 1.4.0. on your system as well and see if it fixes the failed writes at speeds >1.2 GB/s?

ieivanov commented 3 years ago

Yes, I'll take a look, I may have a chance to do that today. I'll let you know what I find.

dpshepherd commented 3 years ago

Testing results using pycromanager 0.8.0 and NDTiffstorage 1.5.0.

Testing code:

# imports
from pycromanager import Bridge, Acquisition
from pathlib import Path
import time

def camera_hook_fn(event,bridge,event_queue):

    core = bridge.get_core()

    return event

def post_hook_fn(event,bridge,event_queue):

    core = bridge.get_core()

    return event

def main():

    debug_circular_buffer=True

    bridge = Bridge()
    core  = bridge.get_core()

    # create ROI similar to sCMOS for fast readout
    core.set_property('Camera','OnCameraCCDXSize',1600)
    core.set_property('Camera','OnCameraCCDYSize',512)
    core.set_property('Camera','FastImage',1)
    time.sleep(10)

    # define channels to use
    channel_states = [1,0,1,1]

    # define dimensions of scan
    number_y = 10
    number_z = 3
    number_scan = 2500

    # exposure time
    exposure_ms = 2.0

    # setup file name
    save_directory=Path('/path/to/data')
    save_name = 'test'

    # set exposure
    core.set_exposure(exposure_ms)

    # initialize position counter
    position = 0

    # create events to execute scan
    events = []

    for y in range(number_y):
        for z in range(number_z):
            for c in range(len(channel_states)):
                for x in range(number_scan):
                    if channel_states[c]==1:
                        if (c==0):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'DAPI'}}
                        elif (c==1):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'FITC'}}
                        elif (c==2):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'Rhodamine'}}
                        elif (c==3):
                            evt = { 'axes': {'z': x, 'position': position}, 'x': 0, 'y': y*100, 'z': z * 10,
                                    'channel' : {'group': 'Channel', 'config': 'Cy5'}}

                        events.append(evt)

            # advance position counter
            position = position + 1

    # run acquisition
    if debug_circular_buffer:
        with Acquisition(directory=save_directory, name=save_name, post_hardware_hook_fn=post_hook_fn,
              post_camera_hook_fn=camera_hook_fn, show_display=False, max_multi_res_index=0) as acq:
            acq._remote_acq.get_storage().discard_data_for_debugging()
            acq.acquire(events)
    else:
        with Acquisition(directory=save_directory, name=save_name, post_hardware_hook_fn=post_hook_fn,
                        post_camera_hook_fn=camera_hook_fn, show_display=False, max_multi_res_index=0) as acq:
            acq.acquire(events)

# run
if __name__ == "__main__":
    main()

Do not save data (debug_circular_buffer=True): The Acquisition block completes. CPU usage hovers around ~20% and RAM usage does not increase from initial allocations for the circular buffer and ImageJ.

Saving data (debug_circular_buffer=False): The Acquisition block makes it farther than before, but still eventually spikes CPU usage to ~100%, RAM usage goes to system max (~64 GB) and then hangs.

dpshepherd commented 3 years ago

Another observation. If I lower the data rate, I can get a situation where CPU, RAM, and disk utilization all vary up and down.

The Python code block for the Acquisition finishes if the CPU usage does not spike to 100%. Interestingly, the hard disk stays active after the Acquisition block returns. At that time, MM shows images remaining in the buffer. If I wait until the disk usage drops to 0%, I am able to load the dataset correctly using Dataset.

henrypinkard commented 3 years ago

At that time, MM shows images remaining in the buffer. If I wait until the disk usage drops to 0%, I am able to load the dataset correctly using Dataset.

Okay I cleaned up the finishing behavior so that the python with statement should block until all the writing is finished

Saving data (debug_circular_buffer=False): The Acquisition block makes it farther than before, but still eventually spikes CPU usage to ~100%, RAM usage goes to system max (~64 GB) and then hangs.

That suggests something going wrong in the NDTiffStorage writing code. I added a bunch of logging to try to figure out where this is happening. If you do Acquisition(debug=True), it should all show up in the Micro-Manager core log. If you can try to monitor it and figure out exactly when it spikes to 100% CPU (there should be timestamps in the core log) then maybe I can pinpoint the problem.

To do this, update to pycromanager 0.9.0, and drop in NDTiff 1.6.1, AcqEng 0.9.0 and pycromanagerJava 0.14.1

dpshepherd commented 3 years ago

This error shows up in the core log when the spikes occur:

2021-01-29T14:29:56.859491 tid20108 [IFO,App] EDTHangLogger: Missed heartbeat; waiting to see if we are stuck on a single event
2021-01-29T14:29:56.860489 tid21256 [IFO,App] Finished writing image
2021-01-29T14:29:58.438277 tid21256 [IFO,App] Writing image tczp: 0  1 107 0
2021-01-29T14:29:58.995789 tid21256 [IFO,App] Finished writing image
2021-01-29T14:29:58.996787 tid21256 [IFO,App] Writing image tczp: 0  1 108 0
2021-01-29T14:30:00.538672 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:01.998774 tid21256 [IFO,App] Writing image tczp: 0  1 109 0
2021-01-29T14:30:01.998774 tid20108 [IFO,App] EDTHangLogger: Event handling has exceeded at least 2136 ms (currently 6937 ms since heartbeat timebase)
                                    [       ] Stack traces follow (note: thread states queried later than stack traces)
                                    [       ] Thread 2 [Reference Handler] RUNNABLE
                                    [       ]   at java.lang.Object.notifyAll(Native Method)
                                    [       ]   at java.lang.ref.ReferenceQueue.enqueue(ReferenceQueue.java:73)
                                    [       ]   at java.lang.ref.Reference.tryHandlePending(Reference.java:217)
                                    [       ]   at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
                                    [       ] Thread 3 [Finalizer] RUNNABLE
                                    [       ]   at mmcorej.MMCoreJJNI.delete_PropertySetting(Native Method)
                                    [       ]   at mmcorej.PropertySetting.delete(PropertySetting.java:32)
                                    [       ]   at mmcorej.PropertySetting.finalize(PropertySetting.java:25)
                                    [       ]   at java.lang.System$2.invokeFinalize(System.java:1270)
                                    [       ]   at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
                                    [       ]   at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
                                    [       ]   at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)
                                    [       ] Thread 4 [Signal Dispatcher] RUNNABLE
                                    [       ] Thread 5 [Attach Listener] RUNNABLE
                                    [       ] Thread 12 [RMI TCP Accept-0] RUNNABLE
                                    [       ]   at java.net.DualStackPlainSocketImpl.accept0(Native Method)
                                    [       ]   at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
                                    [       ]   at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
                                    [       ]   at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)
                                    [       ]   at java.net.ServerSocket.implAccept(ServerSocket.java:545)
                                    [       ]   at java.net.ServerSocket.accept(ServerSocket.java:513)
                                    [       ]   at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:400)
                                    [       ]   at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:372)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 13 [RMI Reaper] WAITING
                                    [       ]   at java.lang.Object.wait(Native Method)
                                    [       ]   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
                                    [       ]   at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
                                    [       ]   at sun.rmi.transport.ObjectTable$Reaper.run(ObjectTable.java:351)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 14 [GC Daemon] TIMED_WAITING
                                    [       ]   at java.lang.Object.wait(Native Method)
                                    [       ]   at sun.misc.GC$Daemon.run(GC.java:117)
                                    [       ] Thread 15 [Java2D Disposer] RUNNABLE
                                    [       ]   at sun.java2d.d3d.D3DRenderQueue.flushBuffer(Native Method)
                                    [       ]   at sun.java2d.d3d.D3DRenderQueue.flushBuffer(D3DRenderQueue.java:152)
                                    [       ]   at sun.java2d.d3d.D3DRenderQueue.flushNow(D3DRenderQueue.java:137)
                                    [       ]   at sun.java2d.d3d.D3DSurfaceData.dispose(D3DSurfaceData.java:727)
                                    [       ]   at sun.java2d.DefaultDisposerRecord.invokeNativeDispose(Native Method)
                                    [       ]   at sun.java2d.DefaultDisposerRecord.dispose(DefaultDisposerRecord.java:42)
                                    [       ]   at sun.java2d.Disposer.clearDeferredRecords(Disposer.java:179)
                                    [       ]   at sun.java2d.Disposer.run(Disposer.java:154)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 16 [AWT-Shutdown] WAITING
                                    [       ]   at java.lang.Object.wait(Native Method)
                                    [       ]   at java.lang.Object.wait(Object.java:502)
                                    [       ]   at sun.awt.AWTAutoShutdown.run(AWTAutoShutdown.java:295)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 17 [AWT-Windows] RUNNABLE
                                    [       ]   at sun.awt.windows.WToolkit.eventLoop(Native Method)
                                    [       ]   at sun.awt.windows.WToolkit.run(WToolkit.java:306)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 19 [AWT-EventQueue-0] WAITING
                                    [       ]   at sun.misc.Unsafe.park(Native Method)
                                    [       ]   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                                    [       ]   at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                                    [       ]   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
                                    [       ]   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
                                    [       ]   at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
                                    [       ]   at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
                                    [       ]   at sun.awt.SunToolkit.awtLock(SunToolkit.java:253)
                                    [       ]   at sun.java2d.pipe.RenderQueue.lock(RenderQueue.java:112)
                                    [       ]   at sun.java2d.pipe.BufferedRenderPipe.fillRect(BufferedRenderPipe.java:125)
                                    [       ]   at sun.java2d.pipe.ValidatePipe.fillRect(ValidatePipe.java:76)
                                    [       ]   at sun.java2d.SunGraphics2D.fillRect(SunGraphics2D.java:2423)
                                    [       ]   at javax.swing.plaf.ComponentUI.update(ComponentUI.java:159)
                                    [       ]   at javax.swing.JComponent.paintComponent(JComponent.java:780)
                                    [       ]   at javax.swing.JComponent.paint(JComponent.java:1056)
                                    [       ]   at javax.swing.JComponent.paintToOffscreen(JComponent.java:5210)
                                    [       ]   at javax.swing.RepaintManager$PaintManager.paintDoubleBuffered(RepaintManager.java:1579)
                                    [       ]   at javax.swing.RepaintManager$PaintManager.paint(RepaintManager.java:1502)
                                    [       ]   at javax.swing.RepaintManager.paint(RepaintManager.java:1272)
                                    [       ]   at javax.swing.JComponent._paintImmediately(JComponent.java:5158)
                                    [       ]   at javax.swing.JComponent.paintImmediately(JComponent.java:4969)
                                    [       ]   at javax.swing.RepaintManager$4.run(RepaintManager.java:831)
                                    [       ]   at javax.swing.RepaintManager$4.run(RepaintManager.java:814)
                                    [       ]   at java.security.AccessController.doPrivileged(Native Method)
                                    [       ]   at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:80)
                                    [       ]   at javax.swing.RepaintManager.paintDirtyRegions(RepaintManager.java:814)
                                    [       ]   at javax.swing.RepaintManager.paintDirtyRegions(RepaintManager.java:789)
                                    [       ]   at javax.swing.RepaintManager.prePaintDirtyRegions(RepaintManager.java:738)
                                    [       ]   at javax.swing.RepaintManager.access$1200(RepaintManager.java:64)
                                    [       ]   at javax.swing.RepaintManager$ProcessingRunnable.run(RepaintManager.java:1732)
                                    [       ]   at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:311)
                                    [       ]   at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:756)
                                    [       ]   at java.awt.EventQueue.access$500(EventQueue.java:97)
                                    [       ]   at java.awt.EventQueue$3.run(EventQueue.java:709)
                                    [       ]   at java.awt.EventQueue$3.run(EventQueue.java:703)
                                    [       ]   at java.security.AccessController.doPrivileged(Native Method)
                                    [       ]   at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:80)
                                    [       ]   at java.awt.EventQueue.dispatchEvent(EventQueue.java:726)
                                    [       ]   at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:201)
                                    [       ]   at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
                                    [       ]   at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
                                    [       ]   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
                                    [       ]   at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
                                    [       ]   at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
                                    [       ] Thread 22 [D3D Screen Updater] TIMED_WAITING
                                    [       ]   at java.lang.Object.wait(Native Method)
                                    [       ]   at sun.java2d.d3d.D3DScreenUpdateManager.run(D3DScreenUpdateManager.java:423)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 23 [DestroyJavaVM] RUNNABLE
                                    [       ] Thread 24 [User Profile Saver Pool Thread 0] WAITING
                                    [       ]   at sun.misc.Unsafe.park(Native Method)
                                    [       ]   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                                    [       ]   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
                                    [       ]   at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
                                    [       ]   at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 27 [TimerQueue] TIMED_WAITING
                                    [       ]   at sun.misc.Unsafe.park(Native Method)
                                    [       ]   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
                                    [       ]   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
                                    [       ]   at java.util.concurrent.DelayQueue.take(DelayQueue.java:223)
                                    [       ]   at javax.swing.TimerQueue.run(TimerQueue.java:174)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 28 [EDTHangLogger timer] RUNNABLE
                                    [       ]   at java.lang.Thread.dumpThreads(Native Method)
                                    [       ]   at java.lang.Thread.getAllStackTraces(Thread.java:1610)
                                    [       ]   at org.micromanager.internal.diagnostics.EDTHangLogger.checkForHang(EDTHangLogger.java:233)
                                    [       ]   at org.micromanager.internal.diagnostics.EDTHangLogger.access$200(EDTHangLogger.java:34)
                                    [       ]   at org.micromanager.internal.diagnostics.EDTHangLogger$4.run(EDTHangLogger.java:220)
                                    [       ]   at java.util.TimerThread.mainLoop(Timer.java:555)
                                    [       ]   at java.util.TimerThread.run(Timer.java:505)
                                    [       ] Thread 29 [Thread-4] TIMED_WAITING
                                    [       ]   at java.lang.Object.wait(Native Method)
                                    [       ]   at bsh.util.JConsole$BlockingPipedInputStream.read(JConsole.java:771)
                                    [       ]   at java.io.PipedInputStream.read(PipedInputStream.java:377)
                                    [       ]   at java.io.InputStream.read(InputStream.java:101)
                                    [       ]   at bsh.util.JConsole.inPipeWatcher(JConsole.java:667)
                                    [       ]   at bsh.util.JConsole.run(JConsole.java:677)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 30 [BeanShell interpreter] TIMED_WAITING
                                    [       ]   at java.lang.Object.wait(Native Method)
                                    [       ]   at java.io.PipedInputStream.read(PipedInputStream.java:326)
                                    [       ]   at java.io.PipedInputStream.read(PipedInputStream.java:377)
                                    [       ]   at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
                                    [       ]   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
                                    [       ]   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
                                    [       ]   at java.io.InputStreamReader.read(InputStreamReader.java:184)
                                    [       ]   at bsh.JavaCharStream.FillBuff(JavaCharStream.java:160)
                                    [       ]   at bsh.JavaCharStream.ReadByte(JavaCharStream.java:188)
                                    [       ]   at bsh.JavaCharStream.readChar(JavaCharStream.java:289)
                                    [       ]   at bsh.JavaCharStream.BeginToken(JavaCharStream.java:209)
                                    [       ]   at bsh.ParserTokenManager.getNextToken(ParserTokenManager.java:2072)
                                    [       ]   at bsh.Parser.jj_ntk(Parser.java:6022)
                                    [       ]   at bsh.Parser.Line(Parser.java:163)
                                    [       ]   at bsh.Interpreter.Line(Interpreter.java:1011)
                                    [       ]   at bsh.Interpreter.run(Interpreter.java:462)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 32 [ZMQ Server ] RUNNABLE
                                    [       ]   at java.io.FileOutputStream.writeBytes(Native Method)
                                    [       ]   at java.io.FileOutputStream.write(FileOutputStream.java:326)
                                    [       ]   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
                                    [       ]   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
                                    [       ]   at java.io.PrintStream.write(PrintStream.java:480)
                                    [       ]   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
                                    [       ]   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
                                    [       ]   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
                                    [       ]   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
                                    [       ]   at java.io.PrintStream.write(PrintStream.java:527)
                                    [       ]   at java.io.PrintStream.print(PrintStream.java:669)
                                    [       ]   at java.io.PrintStream.println(PrintStream.java:806)
                                    [       ]   at org.micromanager.internal.zmq.ZMQServer.lambda$initialize$1(ZMQServer.java:123)
                                    [       ]   at org.micromanager.internal.zmq.ZMQServer$$Lambda$121/860558540.call(Unknown Source)
                                    [       ]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 33 [reaper-1] RUNNABLE
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:296)
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:278)
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:159)
                                    [       ]   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
                                    [       ]   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
                                    [       ]   at zmq.poll.Poller.run(Poller.java:234)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 34 [iothread-2] RUNNABLE
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:296)
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:278)
                                    [       ]   at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:159)
                                    [       ]   at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
                                    [       ]   at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
                                    [       ]   at zmq.poll.Poller.run(Poller.java:234)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 36 [pool-1-thread-1] RUNNABLE
                                    [       ]   at java.nio.Bits.copyFromArray(Bits.java:808)
                                    [       ]   at java.nio.DirectShortBufferU.put(DirectShortBufferU.java:373)
                                    [       ]   at java.nio.ShortBuffer.put(ShortBuffer.java:859)
                                    [       ]   at org.micromanager.multiresstorage.MultipageTiffWriter.getPixelBuffer(MultipageTiffWriter.java:531)
                                    [       ]   at org.micromanager.multiresstorage.MultipageTiffWriter.writeIFD(MultipageTiffWriter.java:452)
                                    [       ]   at org.micromanager.multiresstorage.MultipageTiffWriter.writeImage(MultipageTiffWriter.java:309)
                                    [       ]   at org.micromanager.multiresstorage.ResolutionLevel$FileSet.writeImage(ResolutionLevel.java:544)
                                    [       ]   at org.micromanager.multiresstorage.ResolutionLevel.putImage(ResolutionLevel.java:233)
                                    [       ]   at org.micromanager.multiresstorage.MultiResMultipageTiffStorage$3.run(MultiResMultipageTiffStorage.java:970)
                                    [       ]   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                                    [       ]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 37 [Acquisition image processing and saving thread] WAITING
                                    [       ]   at sun.misc.Unsafe.park(Native Method)
                                    [       ]   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                                    [       ]   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
                                    [       ]   at java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
                                    [       ]   at org.micromanager.acqj.api.Acquisition.lambda$start$1(Acquisition.java:138)
                                    [       ]   at org.micromanager.acqj.api.Acquisition$$Lambda$154/1240380904.run(Unknown Source)
                                    [       ]   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                                    [       ]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 38 [Acq Eng event generator] RUNNABLE
                                    [       ]   at java.io.FileOutputStream.writeBytes(Native Method)
                                    [       ]   at java.io.FileOutputStream.write(FileOutputStream.java:326)
                                    [       ]   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
                                    [       ]   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
                                    [       ]   at java.io.PrintStream.write(PrintStream.java:480)
                                    [       ]   at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
                                    [       ]   at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
                                    [       ]   at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
                                    [       ]   at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
                                    [       ]   at java.io.PrintStream.newLine(PrintStream.java:546)
                                    [       ]   at java.io.PrintStream.println(PrintStream.java:824)
                                    [       ]   at java.lang.Throwable$WrappedPrintStream.println(Throwable.java:748)
                                    [       ]   at java.lang.Throwable.printStackTrace(Throwable.java:658)
                                    [       ]   at java.lang.Throwable.printStackTrace(Throwable.java:643)
                                    [       ]   at java.lang.Throwable.printStackTrace(Throwable.java:634)
                                    [       ]   at org.micromanager.acqj.internal.acqengj.Engine.lambda$submitEventIterator$4(Engine.java:157)
                                    [       ]   at org.micromanager.acqj.internal.acqengj.Engine$$Lambda$156/674043949.run(Unknown Source)
                                    [       ]   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                                    [       ]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
                                    [       ] Thread 39 [Acquisition Engine Thread] WAITING
                                    [       ]   at sun.misc.Unsafe.park(Native Method)
                                    [       ]   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                                    [       ]   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
                                    [       ]   at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
                                    [       ]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                                    [       ]   at java.lang.Thread.run(Thread.java:748)
2021-01-29T14:30:06.454882 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:08.093510 tid21256 [IFO,App] Writing image tczp: 0  1 110 0
2021-01-29T14:30:08.095504 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:09.485793 tid21256 [IFO,App] Writing image tczp: 0  1 111 0
2021-01-29T14:30:10.144037 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:10.144037 tid21256 [IFO,App] Writing image tczp: 0  1 112 0
2021-01-29T14:30:10.145034 tid17256 [IFO,App] Finished writing. Remaining writing task queue size = 6089
2021-01-29T14:30:10.146032 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:10.146032 tid21256 [IFO,App] Writing image tczp: 0  1 113 0
2021-01-29T14:30:10.148027 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:11.560257 tid21256 [IFO,App] Creating new tiff file: tczp: 0  1 114 0
2021-01-29T14:30:14.579200 tid21256 [IFO,App] finished existing file
2021-01-29T14:30:14.579200 tid21256 [IFO,App] Constructed new file
2021-01-29T14:30:14.579200 tid21256 [IFO,App] Writing image tczp: 0  1 114 0
2021-01-29T14:30:14.580198 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:14.580198 tid21256 [IFO,App] Writing image tczp: 0  1 115 0
2021-01-29T14:30:14.582193 tid21256 [IFO,App] Finished writing image
2021-01-29T14:30:15.956524 tid21256 [IFO,App] Writing image tczp: 0  1 116 0
2021-01-29T14:30:15.957521 tid18400 [IFO,App] EDTHangLogger: First heartbeat after miss (23898 ms since timebase)

I'm testing a bit more, because it seems like we are getting complete datasets now, even when the spikes happen.

dpshepherd commented 3 years ago

Ok. I think if you overflow the circular buffer, you end up with the files that are not the correct sizes. However, this is not noted in the core log.

These are the final lines of a core log for a run where the buffer overflowed due to extremely high data rates (~2.5 GB/s). This dataset that should have size (2500, 10, 3, 2048, 2048). Instead, Datasetreturns (2500, 2, 2048, 2048) when I open and convert to a Dask array.

2021-01-29T14:32:52.849983 tid21256 [IFO,App] Writing image tczp: 0  0 1201 1
2021-01-29T14:32:52.866021 tid21256 [IFO,App] Finished writing image
2021-01-29T14:32:52.906285 tid17256 [IFO,App] Writing executor complete
2021-01-29T14:47:35.961274 tid4132 [IFO,dev:Camera] Sequence thread exiting
2021-01-29T14:48:01.097267 tid9916 [IFO,dev:Camera] Sequence thread exiting

There isn't an error associated, it just does not finish writing the images (should go up to 2499). If you scroll back up, you can see that the previous set of axes did write 2499.

Is pycromanager pulling from the circular buffer into Java memory and then writing the file? If so, maybe upping the ImageJ memory allocation will help?

henrypinkard commented 3 years ago

The previous error is an internal thing in micro-manager that is designed to make sure the GUI remains responsive, and if it finds that it hasn't responded for too long, it dumps the status of every thread. I'm not sure why this would be happening given that you're not even using the GUI-- maybe either because there's some accidental crosstalk (which I would be surprised by) or because the acquisition eats up so many computational resources that it crowds out GUI updates (even though theres nothing really to update).

henrypinkard commented 3 years ago

Is pycromanager pulling from the circular buffer into Java memory and then writing the file? If so, maybe upping the ImageJ memory allocation will help?

Yes, the acquisition engine does this and then dispatches to NDTiffstorage. This would be limited by ImageJ memory (i.e. java virtual machine memory), but I believe there's a bounded queue the limits the number of images that can be held in Java layer at once. Its set to an arbitrary value right now, but I'll add in an option to be able to play with this.

henrypinkard commented 3 years ago

I'm not sure if the core API has a mechanism to detect overflows of the circular buffer and create an error. It might be that its just the calling codes responsibility to pull out images fast enough. @nicost, do you know if this is true?

henrypinkard commented 3 years ago

The previous error is an internal thing in micro-manager that is designed to make sure the GUI remains responsive, and if it finds that it hasn't responded for too long, it dumps the status of every thread. I'm not sure why this would be happening given that you're not even using the GUI-- maybe either because there's some accidental crosstalk (which I would be surprised by) or because the acquisition eats up so many computational resources that it crowds out GUI updates (even though theres nothing really to update).

Actually, I think I know what is causing this. The saving class was attempting to show a progress bar when writing finishes. I took this out and made a new NDTiff, 1.6.2. Maybe this will solve those CPU spikes?

henrypinkard commented 3 years ago

Yes, the acquisition engine does this and then dispatches to NDTiffstorage. This would be limited by ImageJ memory (i.e. java virtual machine memory), but I believe there's a bounded queue the limits the number of images that can be held in Java layer at once. Its set to an arbitrary value right now, but I'll add in an option to be able to play with this.

Actually, it appears that this is not the limiting factor, because,

2021-01-29T14:30:10.145034 tid17256 [IFO,App] Finished writing. Remaining writing task queue size = 6089

Means that all images have been pulled from circular buffer, no more are being acquired, and there are 6089 that are in RAM (in Java) and are waiting to be written to disk.

It might be helpful once you've tried that new version if you can send me a representative corelog where problems occur so I can dig through it

dpshepherd commented 3 years ago

Here is a core log from a failed run. You can see that the writing keeps up with the image acquisition for awhile, then something happens and the images start piling up in the writing task queue until the acquisition ultimately fails and the code writes out whatever images are in the queue. Failed_run_core_log.zip

henrypinkard commented 3 years ago

It looks to me like the writing cant quite keep up with the rate that data is generated, but this is masked early on, maybe by your hardrive being able to absorb a few hundred images worth of data into its writing buffer. I've seen this before, where the writing speed appears artificially fast at first, simply because some low level code in the harddrive is caching data before actually writing it out.

Seems like the writing can catch up by 300 or so images in between sequences acquiring new data, and this ability to catch up is maintained even as images pile up in the writing task queue. What does your hard drive and RAM usage look like during all of this?

Was this at the 2.5 GB/s rate?

Its also not clear to me why the acquisition fails. Seems like images are pulled out the core fast enough, so I would think that even though the writing queue balloons in size, this shoudlnt be a problem so long as you have the RAM to absorb it. Any thoughts?

dpshepherd commented 3 years ago

The hard drive is not being used at all at first and the RAM increases. Then the hard drive write speed goes over the rate at the data is being generated, until it catches up (I think). Then it settles into the correct write speed at ~20% CPU usage, until all the sudden it drops and the CPU/RAM usage goes to max. Then the acquisition eventually fails.

This is at ~300 MB/s. If I run this same acquisition using MM by setting the requested timepoints to 75000 with no delay in between, it completes no problem. I can also complete it using a physical camera doing the same number of images with the same ROI (1600x256).

dpshepherd commented 3 years ago

We just completed a ~500 MB/s run of 75,000 images using native MM and pycromanager with Acquisition(debug=False) in the above script. Messing with different settings some more, I think you are right that the acquisition fails when the hard disk cannot keep up and there is not enough RAM to absorb it.

I still don't understand why we are seeing this behavior at low data rates sometimes.

Maybe it would be useful to check to see if the expected number of images were returned to the Acquisition process and throw an error if not?

henrypinkard commented 3 years ago

I did a lot of digging on this and with your script I think I was able to reproduce similar behavior to what you've been describing. I traced the problem all the way down to inconsistent timing, including some long delays, in the Java call that actually writes buffers of pixel data to disk.

While doing this, and in light of the other stress testing of the data performance in this and other issues, I decided it was worthwhile to undertake a major refactor of NDTiff optimize it for reading and writing performance, make it easier to benchmark and debug, and fix a bunch of expedient hacks I made along the way for adding in new features. I should have a new version of that for testing shortly.

ieivanov commented 3 years ago

I just had another read through this issue, let me know if I can help with any testing or double check any of the issues you're seeing. I haven't pushed write speeds recently, so I don't have much first-hard experience on this.

henrypinkard commented 3 years ago

Okay, I finished up making a ton of the changes under the hood, mostly based around creating a new version of NDTiffStorage. Not only did I clean up a lot of old code to make it easier to work with, and made a new version of the format with higher performance in mind. Specifically, it now writes a binary file called NDTiff.index with each dataset that contains byte offsets for image pixels and metadata, so the files can be read really quickly (I verified this with new version of Dataset, which now opens lightning fast). There was an older, hacky version of this in the previous NDTiff, which had some additional overhead, and also required switching from writing data at the beginning and end of the file which I think may have impacted performance.

Installing requires updating NDTiffStorage, NDViewer, AcqEngJ, PycromanagerJava, and the pip installed pycromanager to the latest version here for the Java ones. I also put in a PR (https://github.com/micro-manager/micro-manager/pull/1096), so these should be in the nightlies soon.

Taking a closer look at the performance issues on my laptop, I can see a clear bottleneck in the actual writing of data to disk. As a result, there is usually a backup of images waiting to write in the saving queue. Transferring images out of the sequence buffer is comparatively really fast, so as long as there is enough space to hold the images in the saving queue, the sequence buffer should easily be able to drain. There is now a limit on the number of images that can be held in that queue, which you can control using Acquisition(saving_queue_size=). If that queue is at maximum capacity, then the sequence buffer won't be able to empty. However, this should not lead to an error, because emptying the sequence buffer occurs on the same thread that dispatches sequences to the hardware, so writing that can't keep will lead to the saving queue filling up, which will in turn slow down the acquisition of more data from hardware.

This is all assuming there is enough memory in the sequence buffer and Java layer, and enough capacity in the saving queue to hold at least one full sequence worth of images. My expectation is that adding more memory in Java and saving queue capacity should improve overall performance, since multiple sequences can be queued up waiting to write and the writing can occur continuously. The sequence buffer size shouldn't need to be any bigger than the size of one full sequence worth of images, since it should never be filled with more than this number. @ieivanov, I recall in our previous testing that we saw the sequence buffer fill up and overflow, which would only make sense if the size of a single sequence was larger than the size of the buffer. Do you remember specifics on that?

Any testing of this either of you could do would be great, and it would also be good to check that nothing else was broken as a side effect. There are still many degrees of freedom to experiment with in the low level writing code (e.g. memory mapping, waiting until acquisition ends to write the index, etc.), but maybe good to start with these as they are to established a baseline and make sure nothing else is going wrong. I was able to consistently get ~600-700 MB/s write speeds (averaged over the acquisition) on my laptop using @dpshepherd's demo cam script

dpshepherd commented 3 years ago

Wow! Thanks for the work on this, sounds like a major effort. Much appreciated.

Does sequence buffer = the circular buffer in Micromanager?

Taking a closer look at the performance issues on my laptop, I can see a clear bottleneck in the actual writing of data to disk. As a result, there is usually a backup of images waiting to write in the saving queue. Transferring images out of the sequence buffer is comparatively really fast, so as long as there is enough space to hold the images in the saving queue, the sequence buffer should easily be able to drain. There is now a limit on the number of images that can be held in that queue, which you can control using Acquisition(saving_queue_size=). If that queue is at maximum capacity, then the sequence buffer won't be able to empty. However, this should not lead to an error, because emptying the sequence buffer occurs on the same thread that dispatches sequences to the hardware, so writing that can't keep will lead to the saving queue filling up, which will in turn slow down the acquisition of more data from hardware.

How will this work with hardware triggered acquisitions where the computer is just reading data from the camera and not sending hardware commands? It should fail if the hardware acquisition produces images faster than the saving queue can empty to the disk.

This is all assuming there is enough memory in the sequence buffer and Java layer, and enough capacity in the saving queue to hold at least one full sequence worth of images. My expectation is that adding more memory in Java and saving queue capacity should improve overall performance, since multiple sequences can be queued up waiting to write and the writing can occur continuously. The sequence buffer size shouldn't need to be any bigger than the size of one full sequence worth of images, since it should never be filled with more than this number. @ieivanov, I recall in our previous testing that we saw the sequence buffer fill up and overflow, which would only make sense if the size of a single sequence was larger than the size of the buffer. Do you remember specifics on that?

For external hardware controlled acquisitions, it will be one big sequence. So the sequence buffer will be much smaller than the acquisition image number and the expectation is that the whole system can keep up with the data rate. The user needs to verify their disk can keep up with the data rate from the camera. The above paragraph made me curious if there is an expectation that the image generation will pause at some point in your approach?

Any testing of this either of you could do would be great, and it would also be good to check that nothing else was broken as a side effect. There are still many degrees of freedom to experiment with in the low level writing code (e.g. memory mapping, waiting until acquisition ends to write the index, etc.), but maybe good to start with these as they are to established a baseline and make sure nothing else is going wrong. I was able to consistently get ~600-700 MB/s write speeds (averaged over the acquisition) on my laptop using @dpshepherd's demo cam script

Will be able to do some testing later this week and over the weekend.

henrypinkard commented 3 years ago

Wow! Thanks for the work on this, sounds like a major effort. Much appreciated.

Thank you for the extensive testing and finding all the weaknesses of the previous system :)

Does sequence buffer = the circular buffer in Micromanager?

Yes

How will this work with hardware triggered acquisitions where the computer is just reading data from the camera and not sending hardware commands? It should fail if the hardware acquisition produces images faster than the saving queue can empty to the disk.

That's a good point, I guess I need to check into that. Is that what your system does now? I got the impression from your demo script that sequences were only running over each scan.

The above paragraph made me curious if there is an expectation that the image generation will pause at some point in your approach?

I don't think so -- I think I just misunderstood your setup. It all comes down to the post_camera_hook. If this is only called once it's one big sequence, but if it's called multiple times its multiple sequences.

dpshepherd commented 3 years ago

How will this work with hardware triggered acquisitions where the computer is just reading data from the camera and not sending hardware commands? It should fail if the hardware acquisition produces images faster than the saving queue can empty to the disk.

That's a good point, I guess I need to check into that. Is that what your system does now? I got the impression from your demo script that sequences were only running over each scan.

We do both the "one big" acquisition and ones that are split into sequences that run over each scan. Both are common approaches, which is why I brought it up.

henrypinkard commented 3 years ago

Gotcha. I'll take a closer look at the "one big" scenario when I have a chance. In the meantime, the split ones are a good starting place for testing

ieivanov commented 3 years ago

Thanks Henry! I'll give this a try soon.

The majority of my acquisitions are hardware-sequenced and larger than any amount of memory I can allocate, so I usually need to make sure that I'm operating in a regime where writing to disk is faster than the input data stream. Sometimes I work in the space where data writing is just a bit slower than the data input rate. In that case the memory buffer starts filling up slowly, and the total acquisition duration is limited by the size of the memory buffer I can allocate. It would be ideal if the acquisition fails safely if the memory buffer fills up, i.e. the saved dataset is not corrupted and data written up to that point can still be accessed.

henrypinkard commented 3 years ago

Good to know. Yeah, this certainly should be doable. I'll test this out on my next pass through the code

dpshepherd commented 3 years ago

A quick note. Using the code I gave you and new pycromanager/associated libarries, the demo config works fine, but our real hardware doesn't complete the acquisition. It seems to acquire all of the images, but the Acquisition doesn't return from the with block.

Using the previous version of pycromanager/associated libraries completes on our real hardware. The real hardware is setup to generate about ~300 MB/s right now.

I need to do some more testing and generate log files, but wanted to drop a note on this observation.

henrypinkard commented 3 years ago

Just pushed a new version (https://github.com/micro-manager/micro-manager/pull/1100) with updated libraries

This should have better logging and correct behavior to automatically abort acq and show an error when sequence buffer overflows. Right now it shows in the GUI, not the python console. Still thinking about how best to propagate these across java python bridge

dpshepherd commented 3 years ago

I'm working on our setup this morning and just placed the newest libraries in a separate install of MM nightly build 20210205.

I still see the same behavior with our actual hardware. The acquisition goes along fine, is writing at the correct speed with no images in the circular buffer. At a seemingly random time, the circular buffer get some images in it (but does not fill), and the entire acquisition halts.

The same code with our actual hardware finishes with the previous major version of pycromanager/associated libraries (before NDTiffStorage 2.0).

I will repeat with the logging commands on and upload.

dpshepherd commented 3 years ago

Here is the corelog.

I let the system sit for 15 minutes or after it froze up, nothing further happened. Same behavior as before. Everything is going fine and then the circular buffer starts to fill and the acquisition freezes.

I was able to regain control of the microscope hardware through the GUI, which are the calls after the writing just stops. This was less than 10% into the total images in the run.

CoreLog20210206T120635_pid12068.zip

henrypinkard commented 3 years ago

Doesn't seem to be any issue with writing speed based on that core log. These are are the write times for each image

henrypinkard commented 3 years ago

But I do see that the writing queue fills up at the end there, which suggests that the Tiff writing code is stuck somewhere

henrypinkard commented 3 years ago

I just added in a bunch more logging to try to see what line its getting hung up on. Try NDTiff 2.1.1

dpshepherd commented 3 years ago

Ok, I think I now know what is going on. Upon changing out for the NDTiff 2.1.1, I started getting this error in the GUI, which I hadn't seen before.

We are demoing a different company's camera this week. It turns out that the MM driver works a bit differently than the API documentation suggests. Long story short, I am now 99% sure that the above error is due to the camera not being in the proper mode to receive a trigger from the external controller. This led to the camera to start sending frames without waiting for a signal from the controller, which led to a race condition and eventually the error. I'm not sure why the older pycromanager/associated libraries also did not throw an error, but it was super helpful that the new ones did!

All that said, a new issue has popped up. The with block for the Acquisition does not return upon completion. In the Python debug, I see the commands:

DEBUG, recieved: {'special': 'acquisition-end'}
DEBUG, sending: {}
DEBUG, recieved: {'special': 'acquisition-end'}
DEBUG, sending: {}

but the acquisition does not return. You can see where I crtl-c out of the code at the end of Python console output.

I have attached the Python console output and MM core log from a run that seems to have collected all the data correctly but did not clean up. This was run at a data generation rate (~600 MB/s) that was causing problems with the old NDTiff. Seems to have run without breaking a sweat, so that it awesome!

Python console: debug_log3.zip MM core log: CoreLog20210207T154352_pid17024 - Copy.zip

henrypinkard commented 3 years ago

I'm not sure why the older pycromanager/associated libraries also did not throw an error, but it was super helpful that the new ones did!

Yeah I discovered in the last update that the acq eng was swallowing exceptions from the camera, so I added that logging in. It still doesn't make sense to me why that would cause the writing queue to fill up, but maybe that was either a coincidence or just some weird consequence of the race condition.

This was run at a data generation rate (~600 MB/s) that was causing problems with the old NDTiff. Seems to have run without breaking a sweat, so that it awesome!

Great! Seems strange that the core log says its only writing at 200 MB/s though...Can you check that all the data is actually there?

The python console file appears blank to me for some reason, but I cleaned up some code that looked like it could potentially be problematic and added more logging into the finishing code in v2.1.2. Python seems to exit the with code fine for me on demo config--Maybe it takes a bit to finish writing everything on your large runs?

dpshepherd commented 3 years ago

Great! Seems strange that the core log says its only writing at 200 MB/s though...Can you check that all the data is actually there?

I don't understand the writing speed, but the data does seem to be all there.

The python console file appears blank to me for some reason, but I cleaned up some code that looked like it could potentially be problematic and added more logging into the finishing code in v2.1.2. Python seems to exit the with code fine for me on demo config--Maybe it takes a bit to finish writing everything on your large runs?

NDTiff v2.1.2 did not help, acquisition is still hung up. I let it sit for 30 minutes and the hard drive never had any activity after the images stopped. This is for a ~300 MB/s run, which doesn't show up as that in the core log.

Here the is python console output when I crtl-c out, Traceback (most recent call last):

  File "run_opm_iterative_stagescan.py", line 516, in <module>
    main()
  File "run_opm_iterative_stagescan.py", line 485, in main
    acq.acquire(events)
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\acquire.py", line 408, in __exit__
    self.await_completion()
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\acquire.py", line 425, in await_completion
    time.sleep(0.1)
KeyboardInterrupt
Exception ignored in: <function JavaObjectShadow.__del__ at 0x000001DF0FFB5310>
Traceback (most recent call last):
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\core.py", line 470, in __del__
    self._close()
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\core.py", line 460, in _close
    self._socket.send(message)
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\core.py", line 115, in send
    self._socket.send_multipart(message_parts)
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\zmq\sugar\socket.py", line 555, in send_multipart
    return self.send(msg_parts[-1], flags, copy=copy, track=track)
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\zmq\sugar\socket.py", line 505, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq\backend\cython\socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq\backend\cython\socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq\backend\cython\socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Unknown error
Exception ignored in: <function JavaObjectShadow.__del__ at 0x000001DF0FFB5310>
Traceback (most recent call last):
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\core.py", line 470, in __del__
    self._close()
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\core.py", line 460, in _close
    self._socket.send(message)
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\pycromanager\core.py", line 115, in send
    self._socket.send_multipart(message_parts)
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\zmq\sugar\socket.py", line 555, in send_multipart
    return self.send(msg_parts[-1], flags, copy=copy, track=track)
  File "C:\Users\dpsheph1\Anaconda3\envs\new_pycro\lib\site-packages\zmq\sugar\socket.py", line 505, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq\backend\cython\socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq\backend\cython\socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq\backend\cython\socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Unknown error

The last two errors are the ones we always see with the two hooks on Windows.

Here is the console output (I checked and can see the output in the .txt file): debug_log5.zip

Here is the MM core log: CoreLog20210208T111558_pid7572 - Copy.zip

dpshepherd commented 3 years ago

Ok, here is an interesting observation. The size of the TIFFs on the disk is pretty close to what it should be for the expected number of images.

size of NDtiff dataset TIFF files on disk / (number of images * size of one image) = 1.0008

However, when I try and read in the data using

from pycromanager import Dataset
dataset = Dataset('/path/to/data')

It reads Reading index...

and doesn't return to the console, even after waiting for 5-10 minutes.

Maybe some frames or indexing are being dropped somewhere? The circular buffer doesn't have images in it and the RAM usage on the computer never goes up.

dpshepherd commented 3 years ago

MM nightly build 20210201 & the previous version of pycromanager/associated libraries successfully run the same code and load the data into a Dask array. All images appear to be present.

henrypinkard commented 3 years ago

Okay I see from that error that on the python side, the with block is just waiting for the java acquisition to finish, which makes sense because it seems the finishing code never actually runs on the python side. Is your index file (inside the full resolution dir) greater than 20 MB? That would suggest it never actually finished properly (which would also explain opening problems)

micro-manager / pycro-manager

High speed / high data rate acquisition fails to complete #225

Bug report