labscript-suite / labscript-utils

Shared modules used by the 𝘭𝘢𝘣𝘴𝘤𝘳𝘪𝘱𝘵 𝘴𝘶𝘪𝘵𝘦. Includes a graphical exception handler, debug tools, configuration management, cross platform filepath conversions, unit conversions and custom GUI widgets.
http://labscriptsuite.org
Other
2 stars 47 forks source link

ToolPalette._layout_widgets() appears to recurse, causing a stack overflow #27

Closed philipstarkey closed 3 years ago

philipstarkey commented 5 years ago

Original report (archived issue) by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


A lab here at NIST got a "Python has stopped working" error upon starting BLACS today. Starting Python as python -X faulthander -m blacs revealed it to be a stack overflow in ToolPalette._layout_widgets(), line 348:

self.setMinimumSize(QSize(self.minimumSize().width(), total_height))

Adding a printline to inspect the dimensions of the QSize() object revealed that _layout_widgets() was being called a large number of times prior to the crash, with the dimensions alternating back and forth between two values.

I added a counter to print the recursion depth and confirm that the method is recursing, but I made a syntax error and BLACS started successfully (having caught the error), modifying its save file and widget geometries such that the stack overflow no longer occurred. So unfortunately I lost the ability to reproduce the problem, as it is sensitive to the widget geometries.

Just documenting what I found here. ToolPalette._layout_widgets() seems to be recursing and not converging on a fixed layout geometry that would break the cycle. If I see it again I will backup BLACS save file and connection table file to create a reliable reproducer of the problem.

Others have reported similar crashes before, but they have not been stack overflows, they have been segfaults that looked to be bugs in Qt. This one actually is plausibly our fault which means we may be able to fix it.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


philipstarkey commented 5 years ago

Original comment by Philip Starkey (Bitbucket: philipstarkey, GitHub: philipstarkey).


This is a bit of a shot in the dark, but I suspect that there is a bug when calculating how many items fit in a row. This is likely due to incorrectly accounting for padding/borders/similar. I think such a error could lead to the behaviour you saw (but it's a a while since I touched that code)

Plausibly this issue could cause segfaults on some versions of Qt as Qt often doesn’t like doing things really fast (some of which may be fixed in later versions of Qt)

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).


I suspect you're right, that the calculation could be a little bit off. But it would be good if the code wasn't sensitive to relatively minor errors or to changing padding values, so fixing the calculation itself might not be the best solution, unless it's totally bomb-proof and looks up all the padding values at run-time with no hard-coded values.

There are also things that can happen where a widget says it is one size, and on that basis a container widget decides it should show a scrollbar or something, then the child widget is updated with the new size of the container (minus the scrollbar width), it declares itself to be smaller, then the parent widget decides it doesn't need a scrollbar after all, repeat. In this case something needs to break the cycle. We might be able to add a "with signals disconnected, update size" or similar to just stop recursion and keep the most recently calculated size. If that size if "good enough" then we'll be happy, even if there is some hysteresis in the number of buttons per row whilst resizing tabs.

zakv commented 3 years ago

I ran into this bug today and managed to work around it in a way that kept the data in the blacs settings file, which is lost when following the suggestion from the mailing list. I opened the blacs settings file in HDFView, then went to front_panel -> _notebook_data and played with the value for window_height, in my case changing it from 732 to 1000. After saving and closing HDFView, I started blacs and it opened just fine. Furthermore, the rest of the front panel data, such as the tab locations, was retained.

I'm sure 1000 isn't a magic number for window height, so using that value won't always solve the issue. That said it still may be more convenient to try out a few different window height values than to delete the file and reconfigure the blacs gui from scratch.

philipstarkey commented 3 years ago

@zakv Can you reproduce it with your configuration if you set the value back to 732?

Also, Do the widgets on one of the visible tabs (at that window size) just require a scroll bar? I'm wondering if it's due to a relayout swapping between needing a scroll bar and not, and perhaps just making the scrollbar always visible would be a sufficient fix?

zakv commented 3 years ago

Can you reproduce it with your configuration if you set the value back to 732?

Yep. I backed up that non-functional config file and was able to reproduce the bug by reverting to it. I also then edited it to set the window height to 1000, opened blacs (which worked), then closed blacs, then reset the window height back to 732 with HDFView, then blacs failed to open again.

Do the widgets on one of the visible tabs (at that window size) just require a scroll bar?

Yep. I don't know a good way to set the height to 732 pixels given that I can't get blacs to open with that setting, but opening it with the height set to 1000, then adjusting it to be ~3/4 it's original height gives this: image

Also, not sure if this is relevant but the blacs window is partially off the left edge of the screen when I open it with that settings file. For reference, here is how it looks right after I open with with the height set to 1000: image

chrisjbillington commented 3 years ago

For what it's worth I've been sent (from one of the other users having this issue) a config file that may be able to reproduce it - haven't looked at it yet, but will give it a shot at some point. Can send it to you if you want to look into this Phil.

zakv commented 3 years ago

I've run into this issue a few times in the last several months and I recently started learning a bit about Qt, so I decided to take another look at it. My hope was to just count the recursion depth and implement Chris's "with signals disconnected, update size" idea when it got too deep. Unfortunately it looks like my initial naïve implementation of that approach won't work.

After uncommenting/adding some debugging print statements in toolpalette.py I tried opening blacs with the same settings file from my comments above. In my case at least, it turns out that the recursion worked differently than I expected. The resursing steps were the following, which occurred for one of the PCI-6713 tabs judging by the names of the tool palettes:

  1. ToolPalette.resizeEvent() is called for the analog output tool palette.
    • This calls _layout_widgets(), which changes the number of analog output columns from 3 to 2.
  2. ToolPalette.resizeEvent() is called for the port0 digital output tool palette.
    • This calls _layout_widgets(), which actually doesn't change the number of digital output columns.
  3. ToolPalette.resizeEvent() is called for the analog output tool palette
    • This calls _layout_widgets(), which changes the number of analog output columns back from 2 to 3.
  4. ToolPalette.resizeEvent() is called for the port0 digital output tool palette.
    • This calls _layout_widgets(), which actually doesn't change the number of digital output columns.
  5. Back to step 1.

Before this I had thought _layout_widgets() was calling itself more directly so I could count the recursion depth by incrementing a counter at the beginning of the method and decrementing it at the end. However it actually recurses by causing events to be added to a queue, and those events don't start running until the previous ones have finished. That means that the recursion counter always just switched between 1 and 0 without incrementing beyond that. The process went something like the following:

  1. Start handling a QResizeEvent.
  2. Increment the recursion counter to 1.
  3. Perform some operations that add QResizeEvent(s) to Qt's queue.
  4. Finish the call to resizeEvent()/_layout_widgets() which decrements the recursion counter back to 0.
  5. Start handling the next QResizeEvent, going back to step 1.

Given that, the only way I can think of to implement Chris's "with signals disconnected, update size" idea is to add custom events to the Qt event queue. Basically resizeEvent() or _layout_widgets() would start by putting a custom IgnoreResizeEvents event in the queue and end by putting a AcceptResizeEvents event in the queue. The class would be updated to effectively ignore all of the QResizeEvents between those custom events.

I should get a chance to try to implement that soon, though I haven't worked with the Qt event system before so it might take a decent amount of trial/error before I get something working.