labscript-suite / labscript-devices

A modular and extensible plugin architecture to control experiment hardware using the 𝘭𝘢𝘣𝘴𝘤𝘳𝘪𝘱𝘵 𝘴𝘶𝘪𝘵𝘦.
http://labscriptsuite.org
Other
5 stars 58 forks source link

BLACS hang with continuous camera acquisitions #36

Closed philipstarkey closed 4 years ago

philipstarkey commented 5 years ago

Original report (archived issue) by David Meyer (Bitbucket: dihm, GitHub: dihm).


I again ran in to an issue I first saw working on PR #69 where fast continuous acquisitions with the PylonCamera would cause a complete lock up of the BLACS main process. With further testing and I can now reproduce the issue using the Mock Camera applying the following diff to the IMAQdxCamera blacs_workers.py

@@ -95,5 +95,5 @@
         print(f"Got {len(images)} of {n_images} (mock) images.")

     def snap(self):
-        N = 500
+        N = 100
         A = 500
@@ -99,6 +99,6 @@
         A = 500
-        x = np.linspace(-5, 5, 500)
+        x = np.linspace(-5, 5, N)
         y = x.reshape((N, 1))
         clean_image = A * (1 - 0.5 * np.exp(-(x ** 2 + y ** 2)))

         # Write text on the image that says "NOT REAL DATA"
@@ -101,6 +101,6 @@
         y = x.reshape((N, 1))
         clean_image = A * (1 - 0.5 * np.exp(-(x ** 2 + y ** 2)))

         # Write text on the image that says "NOT REAL DATA"
-        from PIL import Image, ImageDraw, ImageFont
+        #from PIL import Image, ImageDraw, ImageFont

@@ -106,9 +106,9 @@

-        font = ImageFont.load_default()
-        canvas = Image.new('L', [N // 5, N // 5], (0,))
-        draw = ImageDraw.Draw(canvas)
-        draw.text((10, 20), "NOT REAL DATA", font=font, fill=1)
-        clean_image += 0.2 * A * np.asarray(canvas.resize((N, N)).rotate(20))
+        #font = ImageFont.load_default()
+        #canvas = Image.new('L', [N // 5, N // 5], (0,))
+        #draw = ImageDraw.Draw(canvas)
+        #draw.text((10, 20), "NOT REAL DATA", font=font, fill=1)
+        #clean_image += 0.2 * A * np.asarray(canvas.resize((N, N)).rotate(20))
         return np.random.poisson(clean_image)

     def stop_acquisition(self):

Basically, you need to get the fps way up to see the issue, which occurs around iteration #330 on my machine. The PylonCamera shows the same behavior, but it occurs at lower fps (25 fps) and sooner (~130). It does not appear to occur if the randomizer is turned off on the mock camera images (ie return clean_image only).

Best guess at the moment is that the histogram updates of pyqtgraph are causing issues. Is it possible that those updates are non-blocking and can stack up somehow? That doesn’t seem likely but I’m not sure what else it could be at this point.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Thanks for working out how to reproduce the issue! I'll see if I can get it to happen on one of my machines - once I can get it to lock up I should be able to fairly easily work out what the culprit is via using the faulthandler to dump tracebacks of all threads, or something like that.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


I was not able to reproduce this, with frame rates close to 180fps. I'll try again on a different computer later.

However, I did find a deadlock if one clicks 'restart' for the tab while acquiring at high frame rates. So I have fixed that (PR #77). It doesn't seem like the same issue though, even though this one sounds like a deadlock of some sort too.

If you are able to reliably reproduce the problem, then I can tell you what I would be doing if I could reproduce it, and maybe you can show me the resulting tracebacks and we might be able to figure out what's going on.

I would add at the top of blacs/__main__.py:

import faulthandler
f = open(r'C:\labscript_suite\blacs\faulthandler.log', 'a')
f.write('\n\n---restart---\n\n')
faulthandler.dump_traceback_later(timeout=60, repeat=True, file=f)

Then, once the hang occurs, wait a full 60 seconds (or wait less if you decrease the timeout for dumping tracebacks), then quit BLACS. I don't know if it will flush the output file. If it doesn't and Python exits in an unclean way, the output file may not contain the final set of tracebacks. In which case I would switch it to file=sys.stderr and rely on reading it from the terminal, but you may need to increase the number of scrollback lines in the terminal settings, otherwise you won't be able to see all the tracebacks - it outputs one traceback per thread, and there are likely to be tens of threads, depending on how many devices you have.

philipstarkey commented 5 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


OK, so I've made some progress. I have a traceback, which I'll just e-mail to you directly since it is pretty long.

More to the point, I've discovered another important step to reproduce the bug. It seems it only happens if I first load BLACS, change something in the worker, save, then restart the tab. Could it be something with the cache updating?

philipstarkey commented 5 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


We are hitting this issue again in the lab so I've taken some more time to characterize what is happening with added logging of some of the relevant players.

For starters, whatever is going on seems to be just intermittent enough that I can't reliably pin down exactly what steps are needed to reproduce in all cases. I can attest that we will often hit a long term steady-state where this is consistently an issue, particularly with the PylonCamera device, when the fps gets above ~25. For the logs below, I started BLACS, took a snap, started continuous, waited a couple seconds until the fps stabilized, then hit stop. After a few seconds Windows would recognize the GUI as not responding. In this instance, images appear to update correctly until I hit stop.

In order to better debug, I've added logging messages into IMAQdxCameraWorker.continuous_loop() and a bailout check where once 1000 images have been taken it adjusts dt down to 1/15 which I've found breaks the deadlock and allows me to get proper control back. I've also added logging messages to ImageReceiver.handler() in the IMAQdx blacs tab:

logger.info('ImageReceiver Processing')
if self.update_event is not None:
        while True:
            # Don't block, and check for self.stopping regularly in case we
are
            # shutting down. Otherwise if shutdown is called from the main
thread we
            # would deadlock.
            try:
                self.update_event.get(timeout=0.1)
                logger.info('Update event cleared')
                break
            except Empty:
                if self.stopping:
                    return
                logger.info('  Waiting for previous update!')
    self.update_event = inmain_later(self.update, image, self.frame_rate)
    return [b'ok']

With these changes, I now have three logs: the normal BLACS log, one following the ImageReceiver of the BLACS tab, and one following the BLACS worker. Relevant snippets are below. It appears that the BLACS.__main__ loop is getting hung up but the worker is not getting blocked to slow down the flood of images. This particular snippet of BLACS log is unusual in that there is some activity between the continuous start and the main loop hang, probably because I got lucky and started the acquisition just as the check_status jobs went out. Note that the SMF100A has a statemachine_timeout of 5 seconds. There is an NI_DAQ and two other cameras connected that don't appear in this snippet. The hang is seen when only the AnalysisSubmission loop is logging events and lasts for about a minute.

Checking the task manager on this, the blacs worker thread barely changes but the __main__ thread explodes, eating 50% of the processor time and the memory footprint ballooning from 170MB to nearly 500MB until the continuous loop is brought to heel.

At this point I'm a bit stuck as to what to do next. The next step in the chain is qtutils.inmain_later, which I don't understand very well. Unless someone has another thought, could I get a brief explanation of it and/or a pointer to where my next level of debug logging needs to be to track down this hang?

BLACS.log

2019-09-06 10:00:40,681 INFO BLACS.AnalysisSubmission.mainloop: Processed signal: check/retry
2019-09-06 10:00:40,704 DEBUG BLACS.CCD_2.mainloop: Processing event _start_continuous
2019-09-06 10:00:40,708 DEBUG BLACS.CCD_2.mainloop: Instructing worker main_worker to do job start_continuous
2019-09-06 10:00:40,713 DEBUG BLACS.CCD_2.mainloop: Waiting for worker to acknowledge job request
2019-09-06 10:00:40,714 DEBUG BLACS.CCD_2.mainloop: Worker reported job started, waiting for completion
2019-09-06 10:00:40,742 DEBUG BLACS.CCD_2.mainloop: Job completed
2019-09-06 10:00:40,743 DEBUG BLACS.CCD_2.mainloop: returning worker results to function _start_continuous
2019-09-06 10:00:40,748 DEBUG BLACS.CCD_2.mainloop: Finalising function
2019-09-06 10:00:40,751 DEBUG BLACS.CCD_2.mainloop: Waiting for next event
2019-09-06 10:00:40,772 DEBUG BLACS.pulseblaster_0.mainloop: Processing event _status_monitor
2019-09-06 10:00:40,776 DEBUG BLACS.pulseblaster_0.mainloop: Instructing worker main_worker to do job check_status
2019-09-06 10:00:40,777 DEBUG BLACS.pulseblaster_0_main_worker.worker: Got job request check_status
2019-09-06 10:00:40,777 DEBUG BLACS.pulseblaster_0_main_worker.worker: Starting job check_status
2019-09-06 10:00:40,778 DEBUG BLACS.pulseblaster_0_main_worker.worker: Job complete
2019-09-06 10:00:40,778 DEBUG BLACS.pulseblaster_0.mainloop: Waiting for worker to acknowledge job request
2019-09-06 10:00:40,778 DEBUG BLACS.pulseblaster_0_main_worker.worker: Waiting for next job request
2019-09-06 10:00:40,781 DEBUG BLACS.pulseblaster_0.mainloop: Worker reported job started, waiting for completion
2019-09-06 10:00:40,782 DEBUG BLACS.pulseblaster_0.mainloop: Job completed
2019-09-06 10:00:40,783 DEBUG BLACS.pulseblaster_0.mainloop: returning worker results to function _status_monitor
2019-09-06 10:00:40,830 DEBUG BLACS.pulseblaster_0.mainloop: Finalising function
2019-09-06 10:00:40,872 DEBUG BLACS.pulseblaster_0.mainloop: Waiting for next event
2019-09-06 10:00:40,960 DEBUG BLACS.pulseblaster_0.mainloop: Processing event __timeout_add
2019-09-06 10:00:41,146 DEBUG BLACS.pulseblaster_0.mainloop: Waiting for next event
2019-09-06 10:00:47,502 DEBUG BLACS.SMF100A.mainloop: Processing event _status_monitor
2019-09-06 10:00:47,506 DEBUG BLACS.pulseblaster_0.mainloop: Processing event _status_monitor
2019-09-06 10:00:47,662 DEBUG BLACS.SMF100A.mainloop: Instructing worker main_worker to do job check_status
2019-09-06 10:00:47,665 DEBUG BLACS.SMF100A_main_worker.worker: Got job request check_status
2019-09-06 10:00:47,666 DEBUG BLACS.SMF100A_main_worker.worker: Starting job check_status
2019-09-06 10:00:47,665 DEBUG BLACS.pulseblaster_0.mainloop: Instructing worker main_worker to do job check_status
2019-09-06 10:00:47,667 DEBUG BLACS.SMF100A_main_worker.worker: Job complete
2019-09-06 10:00:47,668 DEBUG BLACS.SMF100A_main_worker.worker: Waiting for next job request
2019-09-06 10:00:47,668 DEBUG BLACS.pulseblaster_0_main_worker.worker: Got job request check_status
2019-09-06 10:00:47,669 DEBUG BLACS.pulseblaster_0_main_worker.worker: Starting job check_status
2019-09-06 10:00:47,669 DEBUG BLACS.pulseblaster_0_main_worker.worker: Job complete
2019-09-06 10:00:47,669 DEBUG BLACS.pulseblaster_0_main_worker.worker: Waiting for next job request
2019-09-06 10:00:47,742 DEBUG BLACS.SMF100A.mainloop: Waiting for worker to acknowledge job request
2019-09-06 10:00:47,745 DEBUG BLACS.pulseblaster_0.mainloop: Waiting for worker to acknowledge job request
2019-09-06 10:00:47,747 DEBUG BLACS.SMF100A.mainloop: Worker reported job started, waiting for completion
2019-09-06 10:00:47,750 DEBUG BLACS.pulseblaster_0.mainloop: Worker reported job started, waiting for completion
2019-09-06 10:00:47,752 DEBUG BLACS.SMF100A.mainloop: Job completed
2019-09-06 10:00:47,753 DEBUG BLACS.pulseblaster_0.mainloop: Job completed
2019-09-06 10:00:47,753 DEBUG BLACS.SMF100A.mainloop: returning worker results to function _status_monitor
2019-09-06 10:00:47,753 DEBUG BLACS.pulseblaster_0.mainloop: returning worker results to function _status_monitor
2019-09-06 10:00:47,878 DEBUG BLACS.SMF100A.mainloop: Finalising function
2019-09-06 10:00:47,881 DEBUG BLACS.pulseblaster_0.mainloop: Finalising function
2019-09-06 10:00:47,964 DEBUG BLACS.SMF100A.mainloop: Waiting for next event
2019-09-06 10:00:47,966 DEBUG BLACS.pulseblaster_0.mainloop: Waiting for next event
2019-09-06 10:00:48,128 DEBUG BLACS.SMF100A.mainloop: Processing event __timeout_add
2019-09-06 10:00:48,129 DEBUG BLACS.pulseblaster_0.mainloop: Processing event __timeout_add
2019-09-06 10:00:48,337 DEBUG BLACS.SMF100A.mainloop: Waiting for next event
2019-09-06 10:00:48,339 DEBUG BLACS.pulseblaster_0.mainloop: Waiting for next event
2019-09-06 10:00:50,838 INFO BLACS.AnalysisSubmission.mainloop: Processed signal: check/retry
2019-09-06 10:01:00,994 INFO BLACS.AnalysisSubmission.mainloop: Processed signal: check/retry
2019-09-06 10:01:11,153 INFO BLACS.AnalysisSubmission.mainloop: Processed signal: check/retry
2019-09-06 10:01:19,017 DEBUG BLACS.CCD_2.mainloop: Processing event _stop_continuous
2019-09-06 10:01:19,017 DEBUG BLACS.pulseblaster_0.mainloop: Processing event _status_monitor
2019-09-06 10:01:19,017 DEBUG BLACS.SMF100A.mainloop: Processing event _status_monitor
2019-09-06 10:01:19,017 DEBUG BLACS.novatech.mainloop: Processing event _check_remote_values
2019-09-06 10:01:19,018 DEBUG BLACS.novatech_static.mainloop: Processing event _check_remote_values
2019-09-06 10:01:19,088 DEBUG BLACS.CCD_2.mainloop: Instructing worker main_worker to do job stop_continuous

BLACS_tab log

2019-09-06 10:00:40,780 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:00:40,781 INFO Camera_Tab: Update event cleared
2019-09-06 10:00:40,812 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:00:40,812 INFO Camera_Tab: Update event cleared
2019-09-06 10:00:40,843 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:00:40,843 INFO Camera_Tab: Update event cleared
2019-09-06 10:00:40,874 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:00:40,875 INFO Camera_Tab: Update event cleared
<lots of repeats without any gaps in time>
2019-09-06 10:01:00,137 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:00,161 INFO Camera_Tab: Update event cleared
2019-09-06 10:01:00,175 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:00,199 INFO Camera_Tab: Update event cleared
2019-09-06 10:01:00,213 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:00,236 INFO Camera_Tab: Update event cleared
2019-09-06 10:01:00,250 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:00,274 INFO Camera_Tab: Update event cleared
2019-09-06 10:01:00,287 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:00,310 INFO Camera_Tab: Update event cleared
<more repeats without gaps>
2019-09-06 10:01:18,937 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:18,940 INFO Camera_Tab: Update event cleared
2019-09-06 10:01:19,003 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:19,004 INFO Camera_Tab: Update event cleared
2019-09-06 10:01:19,081 INFO Camera_Tab: ImageReceiver Processing
2019-09-06 10:01:19,082 INFO Camera_Tab: Update event cleared

BLACS_worker log

2019-09-06 10:00:32,921 DEBUG Camera_Worker_PylonCameraWorker: Job complete
2019-09-06 10:00:32,922 DEBUG Camera_Worker_PylonCameraWorker: Waiting for next job request
2019-09-06 10:00:32,958 DEBUG Camera_Worker_PylonCameraWorker: Got job request program_manual
2019-09-06 10:00:32,958 DEBUG Camera_Worker_PylonCameraWorker: Starting job program_manual
2019-09-06 10:00:32,959 DEBUG Camera_Worker_PylonCameraWorker: Job complete
2019-09-06 10:00:32,959 DEBUG Camera_Worker_PylonCameraWorker: Waiting for next job request
2019-09-06 10:00:39,562 DEBUG Camera_Worker_PylonCameraWorker: Got job request snap
2019-09-06 10:00:39,563 DEBUG Camera_Worker_PylonCameraWorker: Starting job snap
2019-09-06 10:00:39,719 DEBUG Camera_Worker_PylonCameraWorker: Job complete
2019-09-06 10:00:39,719 DEBUG Camera_Worker_PylonCameraWorker: Waiting for next job request
2019-09-06 10:00:40,709 DEBUG Camera_Worker_PylonCameraWorker: Got job request start_continuous
2019-09-06 10:00:40,709 DEBUG Camera_Worker_PylonCameraWorker: Starting job start_continuous
2019-09-06 10:00:40,728 INFO Camera_Worker_PylonCameraWorker: Grabbing 0th image
2019-09-06 10:00:40,732 INFO Camera_Worker_PylonCameraWorker: Starting Continuous
2019-09-06 10:00:40,741 DEBUG Camera_Worker_PylonCameraWorker: Job complete
2019-09-06 10:00:40,741 DEBUG Camera_Worker_PylonCameraWorker: Waiting for next job request
2019-09-06 10:00:40,770 INFO Camera_Worker_PylonCameraWorker: Sending 0th image to parent
2019-09-06 10:00:40,783 INFO Camera_Worker_PylonCameraWorker: Grabbing 1th image
2019-09-06 10:00:40,801 INFO Camera_Worker_PylonCameraWorker: Sending 1th image to parent
2019-09-06 10:00:40,814 INFO Camera_Worker_PylonCameraWorker: Grabbing 2th image
2019-09-06 10:00:40,833 INFO Camera_Worker_PylonCameraWorker: Sending 2th image to parent
2019-09-06 10:00:40,844 INFO Camera_Worker_PylonCameraWorker: Grabbing 3th image
2019-09-06 10:00:40,864 INFO Camera_Worker_PylonCameraWorker: Sending 3th image to parent
2019-09-06 10:00:40,878 INFO Camera_Worker_PylonCameraWorker: Grabbing 4th image
2019-09-06 10:00:40,896 INFO Camera_Worker_PylonCameraWorker: Sending 4th image to parent
<lot's of lines without any gaps in time>
2019-09-06 10:01:00,124 INFO Camera_Worker_PylonCameraWorker: Grabbing 507th image
2019-09-06 10:01:00,126 INFO Camera_Worker_PylonCameraWorker: Sending 507th image to parent
2019-09-06 10:01:00,162 INFO Camera_Worker_PylonCameraWorker: Grabbing 508th image
2019-09-06 10:01:00,164 INFO Camera_Worker_PylonCameraWorker: Sending 508th image to parent
2019-09-06 10:01:00,201 INFO Camera_Worker_PylonCameraWorker: Grabbing 509th image
2019-09-06 10:01:00,202 INFO Camera_Worker_PylonCameraWorker: Sending 509th image to parent
2019-09-06 10:01:00,238 INFO Camera_Worker_PylonCameraWorker: Grabbing 510th image
2019-09-06 10:01:00,240 INFO Camera_Worker_PylonCameraWorker: Sending 510th image to parent
2019-09-06 10:01:00,275 INFO Camera_Worker_PylonCameraWorker: Grabbing 511th image
2019-09-06 10:01:00,277 INFO Camera_Worker_PylonCameraWorker: Sending 511th image to parent
<more lines without gaps>
2019-09-06 10:01:18,670 INFO Camera_Worker_PylonCameraWorker: Grabbing 994th image
2019-09-06 10:01:18,672 INFO Camera_Worker_PylonCameraWorker: Sending 994th image to parent
2019-09-06 10:01:18,708 INFO Camera_Worker_PylonCameraWorker: Grabbing 995th image
2019-09-06 10:01:18,710 INFO Camera_Worker_PylonCameraWorker: Sending 995th image to parent
2019-09-06 10:01:18,746 INFO Camera_Worker_PylonCameraWorker: Grabbing 996th image
2019-09-06 10:01:18,748 INFO Camera_Worker_PylonCameraWorker: Sending 996th image to parent
2019-09-06 10:01:18,784 INFO Camera_Worker_PylonCameraWorker: Grabbing 997th image
2019-09-06 10:01:18,785 INFO Camera_Worker_PylonCameraWorker: Sending 997th image to parent
2019-09-06 10:01:18,820 INFO Camera_Worker_PylonCameraWorker: Grabbing 998th image
2019-09-06 10:01:18,822 INFO Camera_Worker_PylonCameraWorker: Sending 998th image to parent
2019-09-06 10:01:18,857 INFO Camera_Worker_PylonCameraWorker: Grabbing 999th image
2019-09-06 10:01:18,859 INFO Camera_Worker_PylonCameraWorker: Sending 999th image to parent
2019-09-06 10:01:18,894 ERROR Camera_Worker_PylonCameraWorker: Adjusting dt from 0.000000 to 0.066667
2019-09-06 10:01:18,924 INFO Camera_Worker_PylonCameraWorker: Grabbing 1000th image
2019-09-06 10:01:18,926 INFO Camera_Worker_PylonCameraWorker: Sending 1000th image to parent
2019-09-06 10:01:18,990 INFO Camera_Worker_PylonCameraWorker: Grabbing 1001th image
2019-09-06 10:01:18,992 INFO Camera_Worker_PylonCameraWorker: Sending 1001th image to parent
2019-09-06 10:01:19,069 INFO Camera_Worker_PylonCameraWorker: Grabbing 1002th image
2019-09-06 10:01:19,071 INFO Camera_Worker_PylonCameraWorker: Sending 1002th image to parent
2019-09-06 10:01:19,090 DEBUG Camera_Worker_PylonCameraWorker: Got job request stop_continuous
2019-09-06 10:01:19,090 DEBUG Camera_Worker_PylonCameraWorker: Starting job stop_continuous
2019-09-06 10:01:19,090 INFO Camera_Worker_PylonCameraWorker: Stopping Continuous
2019-09-06 10:01:19,134 DEBUG Camera_Worker_PylonCameraWorker: Job complete
2019-09-06 10:01:19,134 DEBUG Camera_Worker_PylonCameraWorker: Waiting for next job request
2019-09-06 10:01:26,250 DEBUG Camera_Worker_PylonCameraWorker: Got job request shutdown
2019-09-06 10:01:26,250 DEBUG Camera_Worker_PylonCameraWorker: Starting job shutdown
2019-09-06 10:01:26,251 DEBUG Camera_Worker_PylonCameraWorker: Job complete
2019-09-06 10:01:26,251 DEBUG Camera_Worker_PylonCameraWorker: Waiting for next job request

Edit: fixed wierdness in formatting and truncation due to replying by e-mail

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Sorry again for not seeing this, I think you must be correct that if you use the email response feature, bitbucket sends no notifications. Oh well, we'll be on github soon enough :p.

These results give some leads!

inmain_later queues up a function call to occur in the main thread, and returns a queue.Queue object that the result of the function call (or an exception) will be put to once the function in the main thread runs. It does this by posting a custom Qt event (qtutils.invoke_in_main.CallEvent) that holds the function, its args, and the Queue that the response should be posted to. A custom event handler (qtutils.invoke_in_main.Caller) runs in the main thread and handles the event by calling the function and putting the result to the Queue. So each call to self.update_event.get(timeout=0.1) is waiting for the previously requested function to run in the main thread, and this is supposed to prevent a backlog. If we just called inmain_later without waiting for the result at any point, we definitely would get a backlog there.

The fact that your logging keeps saying "Update event cleared" implies that the update() function is running and looks like it keeping up. This makes me think the backlog is downstream from that, for example, perhaps image_view.setImage() queues up something to happen later, and is not fully synchronous.

To confirm this, you could:

diff -r 4bc2332cf63848b0a8abd81b8bfbab72e5681f5a IMAQdxCamera/blacs_tabs.py
--- a/IMAQdxCamera/blacs_tabs.py        Mon Aug 26 16:13:28 2019 -0400
+++ b/IMAQdxCamera/blacs_tabs.py        Sat Sep 14 16:37:52 2019 -0400
@@ -22,7 +22,7 @@

 import numpy as np

-from qtutils import UiLoader, inmain_later
+from qtutils import UiLoader, inmain_later, inmain
 import qtutils.icons
 from qtutils.qt import QtWidgets, QtGui, QtCore
 import pyqtgraph as pg
@@ -71,19 +71,7 @@
             else:
                 self.frame_rate = 1 / dt
         self.last_frame_time = this_frame_time
-        # Wait for the previous update to compete so we don't accumulate a backlog:
-        if self.update_event is not None:
-            while True:
-                # Don't block, and check for self.stopping regularly in case we are
-                # shutting down. Otherwise if shutdown is called from the main thread we
-                # would deadlock.
-                try:
-                    self.update_event.get(timeout=0.1)
-                    break
-                except Empty:
-                    if self.stopping:
-                        return
-        self.update_event = inmain_later(self.update, image, self.frame_rate)
+        inmain(self.update, image, self.frame_rate)
         return [b'ok']

     def update(self, image, frame_rate):

This will wait for the update() function to run in the main thread before continuing. If the issue is downstream in the pyqtgraph method calls, then it might still occur. However, this change may decrease the attainable frame rate since it delays sending 'ok' back to the sender, so it might make the problem go away solely due to the decreased frame rate and not actually fixing any underlying problem.

Another thing that might hint at where the backlog is is turning on our Python memory profiler, which saves lists of how many of each type of object there is in the interpreter. It can be used as follows:

import labscript_utils.memprof as memprof
# Do this once to get a baseline of number of objects of each
# type - ideally this should be called during normal usage, i.e. not at
# BLACS startup, but before memory has ballooned. Maybe have it
# called the first time ImageReceiver.handler() runs for example.
memprof.start(filepath='<some_filepath>')
# Arrange to call this when the memory has ballooned,
# for example after 1000 images if that is a reliable
# number for reproducing the bug:
memprof.check()

The result will then be a file at containing the increase in the number of each type of object in the interpreter between the start() and check() calls.

There is also the BLACS memory plugin that adds menu entries to BLACS that will call memprof's start and check functions, but since the GUI won't be functioning at that point I think the plugin won't be much use!

If the objects in memfprof's output file are all qtutils CallEvents and numpy arrays of image data, that would be evidence that our posting of events to the main thread might be the culprit (though I think the evidence is mostly against that explanation). If the objects are all things from pyqtgraph, then perhaps pyqtgraph is delaying updates of its own and a backlog is accumulating there.

I think when I tried to reproduce the bug I was not running the latest pyqtgraph from github, whereas I think you were. Maybe something has changed and I should have another go at reproducing the problem.

philipstarkey commented 4 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


OK, scratched out some time to look at this again. Note that I've updated labscript_devices to include the most recent IMAQdxCamera updates. And while I have installed pyqtgraph from github recently, I apparently went back to the conda version 0.10 at some point so all these following tests are with the old version.

Results:

I also took a look with memprof. I did this by initializing at the end of the first call to ImageReciever.handler() then calling check at different iterations of the loop (1,200,500). I believe both the 200 and 500 checks were done while the GUI was completely hung (ie windows recognizes BLACS at not responding). Even the first loop creates quite a bit of noise, but compared with the 200th loop there are a few standouts I can see. First, there are around 40 extra 'managedBuffer' and 'memoryview' class instances. Second, the number of tuples, lists and dicts seems to fluctuate pretty wildly depending on when the check happens. Next, the only meaningful count for pyqtgraph objects is 'pygtgraph.Point.Point'. Finally, while there are plenty of QT5 and SIP objects, they seem pretty consistent no matter which loop I'm looking at.

I also have a memprof when using inmain instead of inmain_later at 100. It is similar to the other loops, though the managedBuffer and memoryview counts are down to 17. That could be coincidental though.

I’m probably going to start looking at pyqtgraph imageSet and see if something jumps out at me. As for reproducing the problem with the Mock camera, you may need to increase the pixel count from 500x500 as well as speeding up the loop. My full frame grabs are 1280x960x16bit at the moment and I suspect the image size may be contributing to the issue.

Here is the top memprof contributors from the first loop:

                              <class 'sip.methoddescriptor'>     8716
                     <class 'PyQt5.QtCore.QLocale.Language'>      375
                      <class 'PyQt5.QtCore.QLocale.Country'>      269
                                          <class 'function'>      232
                                             <class 'tuple'>      230
                            <class 'sip.variabledescriptor'>      207
                                              <class 'dict'>      169
                       <class 'PyQt5.QtCore.QLocale.Script'>      144
                  <class 'PyQt5.QtGui.QPageSize.PageSizeId'>      127
          <class 'PyQt5.QtGui.QOpenGLTexture.TextureFormat'>      126
            <class 'PyQt5.QtGui.QPagedPaintDevice.PageSize'>      126
                                           <class 'weakref'>       89
                       <class 'PyQt5.QtCore.QMetaType.Type'>       78
              <class 'PyQt5.QtGui.QKeySequence.StandardKey'>       71
                        <class 'PyQt5.QtCore.QVariant.Type'>       59
                                 <class 'getset_descriptor'>       58
                    <class 'PyQt5.QtCore.QEasingCurve.Type'>       48
                                              <class 'type'>       29
                                              <class 'cell'>       29
                  <class 'PyQt5.QtCore.QSysInfo.WinVersion'>       29
        <class 'PyQt5.QtWidgets.QMessageBox.StandardButton'>       27
                  <class 'PyQt5.QtCore.QDataStream.Version'>       26
              <class 'PyQt5.QtGui.QOpenGLTexture.PixelType'>       26
                   <class 'PyQt5.QtOpenGL.QGL.FormatOption'>       22
  <class 'PyQt5.QtWidgets.QScrollerProperties.ScrollMetric'>       21
        <class 'PyQt5.QtOpenGL.QGLFormat.OpenGLVersionFlag'>       21
                          <class 'PyQt5.QtCore.QDir.Filter'>       20
      <class 'PyQt5.QtCore.QStandardPaths.StandardLocation'>       20
       <class 'PyQt5.QtGui.QPaintEngine.PaintEngineFeature'>       20
                     <class 'PyQt5.QtGui.QPaintEngine.Type'>       20
<class 'PyQt5.QtWidgets.QTreeWidgetItemIterator.IteratorFlag'>       20
           <class 'PyQt5.QtGui.QImageIOHandler.ImageOption'>       19
            <class 'PyQt5.QtGui.QOpenGLTexture.PixelFormat'>       19
   <class 'PyQt5.QtWidgets.QDialogButtonBox.StandardButton'>       19
              <class 'PyQt5.QtWidgets.QWizard.WizardOption'>       17
                <class 'PyQt5.QtGui.QOpenGLTexture.Feature'>       16
                <class 'PyQt5.QtGui.QPixelFormat.YUVLayout'>       16
         <class 'PyQt5.QtCore.QCryptographicHash.Algorithm'>       15
                <class 'PyQt5.QtCore.QFileDevice.FileError'>       15
           <class 'PyQt5.QtCore.QJsonParseError.ParseError'>       15
         <class 'PyQt5.QtCore.QLibraryInfo.LibraryLocation'>       15
                <class 'PyQt5.QtGui.QPaintEngine.DirtyFlag'>       14
             <class 'PyQt5.QtCore.QUrl.UrlFormattingOption'>       13
                        <class 'PyQt5.QtCore.QDir.SortFlag'>       12
               <class 'PyQt5.QtCore.QFileDevice.Permission'>       12
           <class 'PyQt5.QtCore.QXmlStreamReader.TokenType'>       11
              <class 'PyQt5.QtGui.QOpenGLDebugMessage.Type'>       11
          <class 'PyQt5.QtGui.QOpenGLTexture.BindingTarget'>       11
                 <class 'PyQt5.QtGui.QOpenGLTexture.Target'>       11
          <class 'PyQt5.QtGui.QTextFrameFormat.BorderStyle'>       11
             <class 'PyQt5.QtWidgets.QDateTimeEdit.Section'>       11
     <class 'PyQt5.QtCore.QRegularExpression.PatternOption'>       10
       <class 'PyQt5.QtWidgets.QDialogButtonBox.ButtonRole'>       10
            <class 'PyQt5.QtWidgets.QMessageBox.ButtonRole'>       10
              <class 'PyQt5.QtWidgets.QWizard.WizardButton'>       10
      <class 'PyQt5.QtCore.QSharedMemory.SharedMemoryError'>        9
            <class 'PyQt5.QtGui.QOpenGLBuffer.UsagePattern'>        9
               <class 'PyQt5.QtGui.QPixelFormat.ColorModel'>        9
                        <class 'builtin_function_or_method'>        8
                             <class 'pyqtgraph.Point.Point'>        8
               <class 'PyQt5.QtCore.QIODevice.OpenModeFlag'>        8
       <class 'PyQt5.QtCore.QUrl.ComponentFormattingOption'>        8
        <class 'PyQt5.QtGui.QImageIOHandler.Transformation'>        8
            <class 'PyQt5.QtGui.QOpenGLDebugMessage.Source'>        8
     <class 'PyQt5.QtGui.QOpenGLTexture.ComparisonFunction'>        8
                 <class 'PyQt5.QtGui.QTextListFormat.Style'>        8
 <class 'PyQt5.QtWidgets.QStyleOptionMenuItem.MenuItemType'>        8
                                              <class 'list'>        7
                      <class 'PyQt5.QtCore.QJsonValue.Type'>        7
                 <class 'PyQt5.QtCore.QLocale.NumberOption'>        7
       <class 'PyQt5.QtCore.QOperatingSystemVersion.OSType'>        7
<class 'PyQt5.QtCore.QSystemSemaphore.SystemSemaphoreError'>        7
                        <class 'PyQt5.QtCore.QUuid.Version'>        7
             <class 'PyQt5.QtGui.QTabletEvent.TabletDevice'>        7
           <class 'PyQt5.QtGui.QTouchDevice.CapabilityFlag'>        7
        <class 'PyQt5.QtWidgets.QFileIconProvider.IconType'>        7

Here is the top for the 200th loop:

                              <class 'sip.methoddescriptor'>     8722
                     <class 'PyQt5.QtCore.QLocale.Language'>      375
                      <class 'PyQt5.QtCore.QLocale.Country'>      269
                                              <class 'dict'>      235
                            <class 'sip.variabledescriptor'>      207
                       <class 'PyQt5.QtCore.QLocale.Script'>      144
                  <class 'PyQt5.QtGui.QPageSize.PageSizeId'>      127
          <class 'PyQt5.QtGui.QOpenGLTexture.TextureFormat'>      126
            <class 'PyQt5.QtGui.QPagedPaintDevice.PageSize'>      126
                                             <class 'tuple'>      109
                       <class 'PyQt5.QtCore.QMetaType.Type'>       78
              <class 'PyQt5.QtGui.QKeySequence.StandardKey'>       71
                                              <class 'list'>       69
                                           <class 'weakref'>       63
                             <class 'pyqtgraph.Point.Point'>       60
                        <class 'PyQt5.QtCore.QVariant.Type'>       59
                                             <class 'frame'>       48
                    <class 'PyQt5.QtCore.QEasingCurve.Type'>       48
                                     <class 'managedbuffer'>       40
                                        <class 'memoryview'>       40
                        <class 'builtin_function_or_method'>       39
<class 'numpy.core._internal._unsafe_first_element_pointer'>       38
               <class 'numpy.core._internal.c_char_Array_0'>       38
            <class 'numpy.core._internal.LP_c_char_Array_0'>       38
                                         <class 'traceback'>       36
                  <class 'PyQt5.QtCore.QSysInfo.WinVersion'>       29
        <class 'PyQt5.QtWidgets.QMessageBox.StandardButton'>       27
                  <class 'PyQt5.QtCore.QDataStream.Version'>       26
              <class 'PyQt5.QtGui.QOpenGLTexture.PixelType'>       26
                                 <class 'collections.deque'>       24
                   <class 'PyQt5.QtOpenGL.QGL.FormatOption'>       22
  <class 'PyQt5.QtWidgets.QScrollerProperties.ScrollMetric'>       21
        <class 'PyQt5.QtOpenGL.QGLFormat.OpenGLVersionFlag'>       21
                          <class 'PyQt5.QtCore.QDir.Filter'>       20
      <class 'PyQt5.QtCore.QStandardPaths.StandardLocation'>       20
       <class 'PyQt5.QtGui.QPaintEngine.PaintEngineFeature'>       20
                     <class 'PyQt5.QtGui.QPaintEngine.Type'>       20
<class 'PyQt5.QtWidgets.QTreeWidgetItemIterator.IteratorFlag'>       20
           <class 'PyQt5.QtGui.QImageIOHandler.ImageOption'>       19
            <class 'PyQt5.QtGui.QOpenGLTexture.PixelFormat'>       19
   <class 'PyQt5.QtWidgets.QDialogButtonBox.StandardButton'>       19
                               <class 'threading.Condition'>       18
              <class 'PyQt5.QtWidgets.QWizard.WizardOption'>       17
                <class 'PyQt5.QtGui.QOpenGLTexture.Feature'>       16
                <class 'PyQt5.QtGui.QPixelFormat.YUVLayout'>       16
         <class 'PyQt5.QtCore.QCryptographicHash.Algorithm'>       15
                <class 'PyQt5.QtCore.QFileDevice.FileError'>       15
           <class 'PyQt5.QtCore.QJsonParseError.ParseError'>       15
         <class 'PyQt5.QtCore.QLibraryInfo.LibraryLocation'>       15
                <class 'PyQt5.QtGui.QPaintEngine.DirtyFlag'>       14
             <class 'PyQt5.QtCore.QUrl.UrlFormattingOption'>       13
                        <class 'PyQt5.QtCore.QDir.SortFlag'>       12
               <class 'PyQt5.QtCore.QFileDevice.Permission'>       12
           <class 'PyQt5.QtCore.QXmlStreamReader.TokenType'>       11
              <class 'PyQt5.QtGui.QOpenGLDebugMessage.Type'>       11
          <class 'PyQt5.QtGui.QOpenGLTexture.BindingTarget'>       11
                 <class 'PyQt5.QtGui.QOpenGLTexture.Target'>       11
          <class 'PyQt5.QtGui.QTextFrameFormat.BorderStyle'>       11
             <class 'PyQt5.QtWidgets.QDateTimeEdit.Section'>       11
                                          <class 'function'>       10
     <class 'PyQt5.QtCore.QRegularExpression.PatternOption'>       10
       <class 'PyQt5.QtWidgets.QDialogButtonBox.ButtonRole'>       10
            <class 'PyQt5.QtWidgets.QMessageBox.ButtonRole'>       10
              <class 'PyQt5.QtWidgets.QWizard.WizardButton'>       10
      <class 'PyQt5.QtCore.QSharedMemory.SharedMemoryError'>        9
            <class 'PyQt5.QtGui.QOpenGLBuffer.UsagePattern'>        9
               <class 'PyQt5.QtGui.QPixelFormat.ColorModel'>        9
               <class 'PyQt5.QtCore.QIODevice.OpenModeFlag'>        8
       <class 'PyQt5.QtCore.QUrl.ComponentFormattingOption'>        8
        <class 'PyQt5.QtGui.QImageIOHandler.Transformation'>        8
            <class 'PyQt5.QtGui.QOpenGLDebugMessage.Source'>        8
     <class 'PyQt5.QtGui.QOpenGLTexture.ComparisonFunction'>        8
                 <class 'PyQt5.QtGui.QTextListFormat.Style'>        8
 <class 'PyQt5.QtWidgets.QStyleOptionMenuItem.MenuItemType'>        8
                      <class 'PyQt5.QtCore.QJsonValue.Type'>        7
                 <class 'PyQt5.QtCore.QLocale.NumberOption'>        7
       <class 'PyQt5.QtCore.QOperatingSystemVersion.OSType'>        7
<class 'PyQt5.QtCore.QSystemSemaphore.SystemSemaphoreError'>        7
                        <class 'PyQt5.QtCore.QUuid.Version'>        7
             <class 'PyQt5.QtGui.QTabletEvent.TabletDevice'>        7
           <class 'PyQt5.QtGui.QTouchDevice.CapabilityFlag'>        7
        <class 'PyQt5.QtWidgets.QFileIconProvider.IconType'>        7

philipstarkey commented 4 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


Maybe found where the backup is happening. The pyqtgraph.ItemView ultimately uses the QGraphicsItem.update() https://doc.qt.io/qt-5/qgraphicsitem.html#update method which the documentation states "This function does not cause an immediate paint; instead it schedules a paint request that is processed by QGraphicsView after control reaches the event loop. The item will only be redrawn if it is visible in any associated view."

This does seem related. The current hang behavior is such that when continuous is started images update fine, but all GUI interaction is hung. Clicking in the GUI than causes the image update to hang as well until the GUI interaction processes eventually. Once that processes images will update again with the GUI interaction still hung up. Is BLACS just getting crushed by Qt paint update requests?

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Increasing the image size does it for me, I am able to reproduce the problem now! On linux and with the latest pyqtgraph even, meaning it's not an issue with your specific environment.

I'll get to the bottom of this!

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Regarding the QGraphicsItem.update(), brilliant, I was suspecting something like that, that is definitely the most promising lead. Good find.

There is a bit more to it though - queued events are normally processed in order and so shouldn't be able to build up since we are not allowing our events to build up. However, events also have priorities. So I suspect that the paint events are piling up behind our events, which are being processed with higher priority and so never allowing the backlog of paint events to clear.

Solution could be to make our events lower priority.

philipstarkey commented 4 years ago

Original comment by Philip Starkey (Bitbucket: philipstarkey, GitHub: philipstarkey).


I was also simultaneously investigating along these lines. The relevant Qt code for repainting widgets is here from what I can tell. And it’s linked to monitor refresh rate too, which may explain why it seemed to be fps related. I suggest updating qtutils inmain(_later) to allow an optional priority and to set it low (lower than Qt.LowEventPriority) for those specific update calls.

Alternatively we could have all of our inmain(_later) calls be low priority, but I’m uncertain if that would slow down our code more than we want…

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


I've just locally added a priority kwarg to inmain_later, and am not seeing it fixing the problem if I set a low priority. Will keep investigating...

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


I may have spoken too soon about being able to reproduce the problem. I can cause a temporary but large increase in memory by increasing the image size to be very very large (10k x 10k), but it looks like this is just a delay in reclaiming memory until the garbage collector runs, I don't think it has anything to do with Qt. And I can still interact with the GUI, it's only laggy because the painting actually takes a long time for such large images, not because there is any kind of backlog, seemingly. Though, garbage collection being necessary still does seem strange, I would have thought reference counting would get these image array objects, there should not be circular references that would require garbage collection. If I run the garbage collector once per loop the memory growth goes away, but this could merely be because it modifies the timing.

David, could you post the exact Python and PyQt versions you see the problem with? I am also wondering if it has something to do with whether Qt is using OpenGL or not, based on the Qt source Phil linked to.

philipstarkey commented 4 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


Darn. At this rate we might need to invite you over for a talk so you can actually see it in person.

Here is my conda environment. We are still working on getting Windows 10 upgrades so this is all Windows 7. I don’t believe any package versions have changed more than their micro number, with the exception of zprocess.

# packages in environment at C:\Miniconda3\envs\labscript:
#
# Name                    Version                   Build  Channel
autocython                1.1.4                    pypi_0    pypi
blas                      1.0                         mkl  
ca-certificates           2019.5.15                     1  
certifi                   2019.6.16                py36_1  
cycler                    0.10.0           py36h009560c_0  
cython                    0.29.13          py36ha925a31_0  
freetype                  2.9.1                ha9979f8_1  
h5py                      2.9.0            py36h5e291fa_0  
hdf5                      1.10.4               h7ebc959_0  
icc_rt                    2019.0.0             h0cc432a_1  
icu                       58.2                 ha66f8fd_1  
importlib_metadata        0.19                     py36_0  
intel-openmp              2019.4                      245  
jpeg                      9b                   hb83a4c4_2  
kiwisolver                1.1.0            py36ha925a31_0  
libpng                    1.6.37               h2a8f88b_0  
libsodium                 1.0.16               h9d3ae62_0  
libtiff                   4.0.10               hb898794_2  
matplotlib                3.1.1            py36hc8f65d3_0  
mkl                       2019.4                      245  
mkl-service               2.0.2            py36he774522_0  
mkl_fft                   1.0.14           py36h14836fe_0  
mkl_random                1.0.2            py36h343c172_0  
numpy                     1.16.4           py36h19fb1c0_0  
numpy-base                1.16.4           py36hc3f5095_0  
olefile                   0.46                     py36_0  
openssl                   1.1.1c               he774522_1  
pandas                    0.25.1           py36ha925a31_0  
pillow                    6.1.0            py36hdc69c19_0  
pip                       19.2.2                   py36_0  
psutil                    5.6.3            py36he774522_0  
pycapture2                2.13.61                  pypi_0    pypi
pydaqmx                   1.4.2                    pypi_0    pypi
pyparsing                 2.4.2                      py_0  
pypylon                   1.4.0                    pypi_0    pypi
pyqt                      5.9.2            py36h6538335_2  
pyqtgraph                 0.10.0           py36h28b3542_3  
pyreadline                2.1                      py36_1  
pyserial                  3.4                      py36_0  
python                    3.6.9                h5500b2f_0  
python-dateutil           2.8.0                    py36_0  
pytz                      2019.2                     py_0  
pyvisa                    1.10.0                   pypi_0    pypi
pywin32                   223              py36hfa6e2cd_1  
pyzmq                     18.1.0           py36ha925a31_0  
qt                        5.9.7            vc14h73c81de_0  
qtutils                   2.3.1                    pypi_0    pypi
scipy                     1.3.1            py36h29ff71c_0  
setuptools                41.0.1                   py36_0  
sip                       4.19.8           py36h6538335_0  
six                       1.12.0                   py36_0  
spinapi                   3.2.0                    pypi_0    pypi
sqlite                    3.29.0               he774522_0  
tk                        8.6.8                hfa6e2cd_0  
tornado                   6.0.3            py36he774522_0  
tzlocal                   2.0.0                    py36_0  
vc                        14.1                 h0510ff6_4  
vs2015_runtime            14.16.27012          hf0eaf9b_0  
wheel                     0.33.4                   py36_0  
wincertstore              0.2              py36h7fe50ca_0  
windows-curses            2.0                      pypi_0    pypi
xmlrunner                 1.7.7                    pypi_0    pypi
xz                        5.2.4                h2fa13f4_4  
zeromq                    4.3.1                h33f27b4_3  
zipp                      0.5.2                      py_0  
zlib                      1.2.11               h62dcd97_3  
zprocess                  2.16.0                   pypi_0    pypi
zstd                      1.3.7                h508b16e_0  

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Ok, I'm also on a Windows 7 computer with the same versions of everything and not seeing it :/

Then again, this computer is not very fast. I'm struggling to get high frame rates even with fully pre-generated data. Perhaps this bug requires a faster computer to be triggered, maybe something about the screen refresh rate as Phil has mentioned.

So I guess we're still debugging on your setup. Here is attempt at a solution number 1, based on the hypothesis that it's a backlog of paint events that we can prevent by deprioritising our events:

If you replace _inmain_later() in your locally-installed qtutils/invoke_in_main.py with the following:

def _in_main_later(fn, exceptions_in_main, *args, **kwargs):
    """Asks the mainloop to call a function when it has time. Immediately returns the
    queue that was sent to the mainloop.  A call to queue.get() will return a list of
    [result,exception] where exception=[type,value,traceback] of the exception. The
    keyword argument _priority can be used to pass the event priority to the postEvent()
    call. Functions of the same priority are guaranteed to be called in the order they
    were requested."""
    queue = Queue()
    post_kwargs = {}
    if '_priority' in kwargs:
        post_kwargs['priority'] = kwargs.pop('_priority')
    QCoreApplication.postEvent(
        caller, CallEvent(queue, exceptions_in_main, fn, *args, **kwargs), **post_kwargs
    )
    return queue

And change the inmain_later call in blacs_tabs.py to:

 self.update_event = inmain_later(self.update, image, self.frame_rate, _priority=QtCore.Qt.LowEventPriority - 1)

does that solve the problem?

Attempt at a solution number two. If you append:

QtGui.QApplication.instance().sendPostedEvents()

to the end of ImageReceiver.update(), does that solve the problem? This is also based on the hypothesis of a backlog of other events, and we're hoping sendPostedEvents will process them all immediately. This is pretty hacky but could be informative.

And, attempt at a solution number three. I'm not super convinced that changing the call to inmain() is a satisfactory solution since all that does is inmain_later() followed by a .get() on the resulting queue, so it is basically the same except that the .get() is immediate instead of just before the next call.

We can't just get rid of waiting on the queue manually since we need to be checking self.stopping regularly to avoid the possibility of a deadlock upon tab restart, but we could manually .get() from the queue immediately instead of waiting until just before the next event. Leaving it until the last second was just a way of speeding up the max frame rate slightly by allowing the worker to send the next frame prior to the previous one being rendered.

But it sounds like that will make the problem go away, maybe just for timing reasons, even though I'm not convinced it addresses whatever the root cause is.

So if nothing else presents itself about solving this issue and if I continue to be unable to reproduce it, maybe the following patch is the pragmatic way forward. It's dissatisfying though, and will probably limit the achievable frame rates of other cameras:

diff -r 4bc2332cf63848b0a8abd81b8bfbab72e5681f5a IMAQdxCamera/blacs_tabs.py
--- a/IMAQdxCamera/blacs_tabs.py        Mon Aug 26 16:13:28 2019 -0400
+++ b/IMAQdxCamera/blacs_tabs.py        Fri Sep 20 16:30:57 2019 -0400
@@ -56,7 +56,6 @@
         self.label_fps = label_fps
         self.last_frame_time = None
         self.frame_rate = None
-        self.update_event = None

     def handler(self, data):
         md = json.loads(data[0])
@@ -71,19 +70,18 @@
             else:
                 self.frame_rate = 1 / dt
         self.last_frame_time = this_frame_time
-        # Wait for the previous update to compete so we don't accumulate a backlog:
-        if self.update_event is not None:
-            while True:
-                # Don't block, and check for self.stopping regularly in case we are
-                # shutting down. Otherwise if shutdown is called from the main thread we
-                # would deadlock.
-                try:
-                    self.update_event.get(timeout=0.1)
-                    break
-                except Empty:
-                    if self.stopping:
-                        return
-        self.update_event = inmain_later(self.update, image, self.frame_rate)
+        update_event = inmain_later(self.update, image, self.frame_rate)
+        # Wait for the update to compete so we don't accumulate a backlog:
+        while True:
+            # Don't block, and check for self.stopping regularly in case we are
+            # shutting down. Otherwise if shutdown is called from the main thread we
+            # would deadlock.
+            try:
+                update_event.get(timeout=0.1)
+                break
+            except Empty:
+                if self.stopping:
+                    return
         return [b'ok']

     def update(self, image, frame_rate):
philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Also, what do you get for the following?

import pyqtgraph
print(pyqtgraph.CONFIG_OPTIONS)

Mostly interested in whether openGL is in use. It seems like for me it is not, but enabling it doesn't cause the bug to occur, either.

philipstarkey commented 4 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


Results from testing:

Solution number 2) Started here because it was the easiest to implement. This one works. I actually get a slightly higher framerate (28 fps) than using inmain in place of inmain_later.

Solution number 1) This one helps, but the hangs are still present. They are significantly shorter though. Max framerate was 30 fps.

Solution number 3) This also works. Max framerate is 28 fps.

pyqtgraph.CONFIG_OPTIONS = {'useOpenGL':False,'leftButtonPan':True,'foreground':'d','background':'k','antialias':False,'editorCommand':None,'useWeave':False,'weaveDebug':False,'exitCleanup':True,'enableExperimental':False,'crashWarning':False,'imageAxisOrder':'col-major'}

Hopefully it doesn't come to something hacky just our sake since it appears nobody else seems to be having this issue. I'm willing to try a few more things to get to the bottom of this if possible.

philipstarkey commented 4 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


This is somewhat unrelated, but should we use the pyqtgraph global option imageAxisOrder='row-major' to avoid the need to transpose the images in ImageReceiver.update? It seems like it would improve performance a bit.

philipstarkey commented 4 years ago

Original comment by Philip Starkey (Bitbucket: philipstarkey, GitHub: philipstarkey).


@chrisjbillington FYI @dihm updated this issue and we didn’t get email notifications for it! Writing this message so you see the updates!

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Thanks Phil, and thanks David.

So whilst I don't see as obvious a hang, it is clear that user interaction is much laggier for me in some circumstances than others, and it seems to follow the same conditions as you say cause your hang. It goes away for me when I apply the changes that you say solve the issue for you, so I am going to assume they have the same root cause.

Event priorities are not solving the issue for me, and as you said they only seemed to help a little bit for you. It seems to me that they are not sufficient - perhaps the priorities of our events and other events are not even being compared, maybe because we have a custom event handler. Maybe the priorities are only meaningful within a single event handler and not app-wide. I don't know, but I couldn't seem to change much by using event priorities.

So I'm proposing PR #83 as a fix. This is essentially the sendPostedEvents() fix, though I've gone ahead and merged the handler() and update() methods as a simplification - they were only separate in order to send the reply to the worker ASAP, but with a tiny addition to zprocess (update required), the handler() method can now send a reply immediately and then continue to do its processing, so no need for a separate update() call to be queued up.

This removes the GUI lagginess for me, and it sounds like it will for you too based on your testing. It still allows the worker to be fetching a new frame whilst the parent process is rendering the previous frame, so we still in principle should be able to render frames at the maximum rate of whichever is slowest: rendering or downloading.

It is usually code smell to be manually calling sendPostedEvents or similar things, but since we are in the main thread when we call it, and since most events are dispatched immediately in Qt anyway (leading to re-entrancy even in single-threaded programs), all it's doing is not allowing these other events to be an exception to that. And we're calling it at the very end of our function, where control is about to be returned to the mainloop anyway, which would be doing the same thing. So I think this is an OK solution.

One possibility is that the events that are processed may themselves post further events, such that we may need to call sendPostedEvents() multiple times to clear the whole backlog. But a) there doesn't seem to be any way to check if there are pending events or to say "run until there are no more", and b) we're still calling it once per frame, so nothing should build up for more than one frame. I think we're good.

David if you could confirm this fixes the problem, maybe we can put this one to rest :)

Regarding imageAxisOrder='row-major', I don't think there is any significant cost to transposing the arrays since this is just a metadata change, it doesn't actually move the data in memory. So unless there is a "real" transpose operation happening at a lower level that imageAxisOrder='row-major' somehow avoids (which seems doubtful), I don't think this will make any difference. Profiling results would change my mind though :).

philipstarkey commented 4 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).


PR #83 does it for me as well, and it does seem like the best solution. I’m more than willing to put this annoying issue to rest.

As for imageAxisOrder, you are likely right, but I’ll check with profiling very quickly just to set my mind at ease.

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Fix issue #36: BLACS freezing in IMAQdxCamera continuous acquisition.

→ \<\<cset 933d05786909278945ab5c45d95dd2a5bda8272b>>

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


Merged in camera-hang-fix (pull request #83)

Fix issue #36: BLACS freezing in IMAQdxCamera continuous acquisition.

Approved-by: David Meyer

→ \<\<cset d1a0b996737537412f571473e6d4c6f29683986d>>