Preparse thread memory leak/crash

philipstarkey commented 5 years ago

Original report (archived issue) by David Meyer (Bitbucket: dihm, GitHub: dihm).

The original report had attachments: RunmangerScreencap.png

So I just brought my repositories up to speed and now I can't get runmanager to work. I'm not getting any really helpful information, but here is what I have so far:

Upon opening the GUI, the engage button is always grayed out.
All of my globals (including simple numbers) only show as "Evaluating" in their hover text.
One of the python processes spawned by runmanager quickly balloons in memory (multiple GB after a few seconds).
I'm running python 2.7.15.

Attached is a screenshot of the GUI.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Thanks for the bug report. This is a little concerning! Do you know what version you were on before the update? Perhaps we can 'bisect' the changes and figure out where the issue was introduced. If you're enthusiastic about this, you would want to search among commits in the default branch only - there are not that many (tortoisehg can show only the default branch if you press ctrl-s and select 'default' from the dropdown menu in the top right).

If you are happy sending me your globals file, I can also see if I can reproduce it.

I also notice you have customised runmanager a little - is the issue present on "stock" runmanager?

philipstarkey commented 5 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).

I was at the merge for pull request #27 and I'll e-mail the globals file to you directly.

The modification is just adding a QTimer to press the engage button every X seconds if that check box is checked, so hopefully that isn't the issue. I can try "stock", along with some other testing, when I get back to the lab tomorrow.

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Ah, it sounds doubtful a non-active Qt timer could have anything to do with it.

With the h5 file you sent me, I can reproduce the issue (though with other fairly complex ones I could not). I'll look into it!

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

The issue is independent of Python 2 vs Python 3.

I have to revert back to version 2.1.0, just before the merge of pull request #15 before it works again.

However, if I then update back to the latest version, the issue no longer occurs. This is because version 2.1.0 modifies the names of the zip groups, presumably erroneously since pull request #15 is called "Fixed several bugs where custom zip groups were overridden." So that means there's some hysteresis to the testing...I'm having to restore the h5 file in between each test. It also means 2.1.0 is probably not actually immune to the problem, it's just not hitting on the conditions that trigger it since it modified the zip groups. So this seems at odds with you seeing things working before updating from the commit at pull request #27, so something else must have changed as well.

For anyone else interested, here's a screenshot of the offending group:

Screenshot from 2019-03-28 18-30-57.png

Activating just this group causes the crash - takes about 20 seconds to fill my computer's 32GB of memory before the OS kills it.

It's pretty compex, I'm definitely not able to tell how many shots this results in by looking at it. You've got six axes, lots of references, one global is an array that is the concatenation of three of the others. No wonder we didn't see this elsewhere! My guess is that runmanager is interpreting one of these arrays as being extremely large such that it doesn't fit in memory.

Further debugging reveals that runmanager is choking on expanding the outer product over all axes - it interprets the above as describing 57,600,000 shots, so no wonder.

The axes are as follows:

      zip Res_regime n =  40
                zip  n =  1
       zip AC_regime n =  60
      zip uWave_scan n =  120
         zip resDets n =  10
     zip High_regime n =  20

(the unnamed zip group with n=1 must be the globals not in any group, not sure).

If this isn't the result you were going for, can you tell what dimensions runmanager should have gotten?

Or perhaps runmanager changed the expansion settings, and when set back to what they should be, it doesn't choke on them?

If you want to debug further (or just recover from the situation) you can modify runmanager's default saved config file (/runmanager.ini) and deactivate the groups (or just rename the config file so runmanager starts up fresh). You can then open the globals file in a fresh runmanager, and modify the values before setting the group to be active. The preparser won't touch it if it's inactive.

philipstarkey commented 5 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

So I don't have time to look into this in more detail today, but this is my thoughts so far:

using np.concatenate() really should be avoided. It's not going to play nice with zip group detection at all...for example notice that uWave_scan is in a zip group with AC_regime, but they are going to be different lengths.
The use of np.concatenate() might be acceptable if it was used as np.concatenate(np.array(arr1), np.array(arr2), ...) where arr1, arr2, etc. are defined as tuples in runmanager (so they don't form an axis of the parameter space)

If the aim is to only have uWave_scan as the axis of the parameter space (and everything else is ignored), then wrapping all of the other globals in tuple() (and converting back to an array when using in a subsequent global) will probably make the problem go away.

I'm not sure there is much else we could do to fix it in runmanager, because trying to make runmanager form a single axis out of 6 globals of different lengths is probably very, very hard to get right (especially since we need to do it in a way where it never attempts to expand them in an outer product first, even if the logic would ultimately find the correct configuration after several iterations).

philipstarkey commented 5 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).

Should have figured it was something to do with this. The aim is to use uWave_scan as the only axis of the parameter space. The other globals just help in constructing the somewhat complicated scan structure (in this case, a mixture of 5 separate log scales and a linear scale).

I end up needing fairly complex scans on a regular basis and have found that building in chunks worked OK in the past. Normally runmanager just gets confused then I manually deactivate all the unwanted expansions. Must have pushed a little too far adding that final linear range and didn't realize it until I restated runmanager after the update and the expansions were rechecked. Oops. I am back up and running though.

Anyway, a few follow-up items then we can probably let this issue go.

The first thing I tried to get up and running was to just go into the globals file and remove all the expansions. That was only mildly effective though since runmanager seems to ignore that field and re-guess the expansions from scratch when first opened. Is that really desirable behavior? Why save the expansions to the file if they are ignored when it's read anyway?
Wrapping things in tuples works, but I'm not a fan of it as a long term solution. While I find it a bit obfuscating, my bigger concern is that it eats up real estate in the value field, which is already pretty limited. In fact, the main reason I have been breaking up the scan calculations is because putting them into one monolithic line isn't manageable since the entry field isn't big enough to see everything. Is there a way to 'fix' this more elegantly? I agree with Phil that having runmanager somehow do the logic automatically isn't worth the time or effort. I can conjure a few potential ideas that may be simpler:
- Some sort of control that tells runmanager to ignore a group when guessing expansions. Likely another column in the group tab. Then all these intermediate calculations could be moved into an ignored group. I imagine zip group detection might get a bit tricky though.
- What about just making the value field line wrap? Then it would be easy to move all the calculations to a single field.
- Have the expansions on first start be read from the globals file by default. Add a "Re-evaluate Expansions" button somewhere the user can manually use that provides that hard-reset functionality.

Anyway, those are just a few thoughts. As always, thanks for your prompt replies.

philipstarkey commented 5 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

Hi David,

Just wondering about your second point: Is the issue that the line is too long to fit on your monitor? Or that, no matter how big you make runmanager, the field stays the same size. As far as I'm aware both the value display and editing field of global values can grow as much as you like (and the whole view is in a scroll window). Just wondering if this is what you see too, or if there is a subtle Qt bug that makes your value field display differently to mine.

philipstarkey commented 5 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).

Chris,

Oh, it is definitely just that the line is too long to fit on the monitor.

I actually went ahead and took a stab at point 2 since it seemed like it might be easy enough for me to quickly try out. That turned out not to be the case but it is half-way there. It's done by over-loading the sizeHintRole of AlternatingColorModel.data() and adding an extra signal/slot for column resizing to trigger it.

It's only half done because the editor box does not have linewrapping making this a purely cosmetic change for the moment.

I'm reasonably sure it's possible, but it goes beyond the time I'm willing to commit to something that probably won't leave our lab anyway.

philipstarkey commented 5 years ago

Original comment by David Meyer (Bitbucket: dihm, GitHub: dihm).

I had some extra time and figured this out the rest of the way. Turns out to be pretty simple once you find the right widget. Boils down to overloading the ItemDelegate.createEditor() method to use QPlainTextEdit instead of the default QLineEdit. It looks like this:

I'm actually fairly happy with this solution and intend to go with it. Is there interest enough for me to submit a PR?

philipstarkey commented 5 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Hi David,

Looks great!

I see no downside to making line-wrapping the default, and since your code isn't doing anything magical (overriding the methods in the way you've done sounds like the 'proper' way to do this) I don't think there'd be any issue of choking on edge-cases. So I'm interested, if you make a pull request I'd be happy to review and would merge if there are no obvious problems!

labscript-suite-temp-2 / runmanager

Preparse thread memory leak/crash #66