Allow Multishot Routines to add to dataframe

philipstarkey commented 7 years ago

Original report (archived issue) by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

In our lab we use a mean Image, generated out of all Images taken in a sequence, to determine the position of single atom traps. This data is then used to determine (for each trap) if there was a atom or not. We modified our Sequene.runs to allow for save_result to save this information. But since the dataframe only updates after singleshot routines these results are not available to other routines.

It would be nice if there were a function to allow adding a list to the dataframe as a column from a multishot routine. Alternatively one could introduce a flag that can be set in a multishot routine, so that the dataframe gets updated

philipstarkey commented 7 years ago

Original comment by Shaun Johnstone (Bitbucket: shjohnst, GitHub: shjohnst).

There used to be something like this, where a multishot analysis routine would be passed a a path to a h5 file for storing results in. I'm not sure if the results ended up in the dataframe, but you could use them by opening the h5 file from other routines. This saving of results would be done with something along the lines of:

#!python

seq = Sequence(path, df)
seq.save_result('result name',result)

In fact, this was all listed in the (out of date) documentation (based on the older gtk version). Somewhere along the way something was changed so that lyse.path = None for multishot, perhaps in the port to qt? @cbillington, was there a reason for this change, or is it just that no one here at Monash has been using the Sequence class in their scripts, so the functionality was forgotten about and broken as other things were changed?

Perhaps rather than saving to a routine specific h5 file, we should be saving the result to each individual h5 file in the sequence? When you call Sequence.save_result() it could check if path == None, in which case it will then save to every run_path instead?

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Yes, it was lost in the port to Qt because nobody was using it.

That's not what @PhyNerd is getting at though, I think. He's saving the data to individual shot files, it's just that lyse isn't going out to the files again to get this new data and put it in the dataframe.

We could certainly do this, something like, communicating to lyse in some way a list of shots that need updating - or save_result() could automatically inform lyse in some way that that shot needs updating. That's a pretty good way of doing it actually, and would help avoid needlessy re-opening the h5 file when nothing had changed.

The trouble with it is that opening and closing lots of files is a bit slow. Lyse would hang for potentially some tens of seconds if you had a few hundreds of shots. So probably a progress bar would be in order. Perhaps there could be some way to make it more efficient. If it is useful despite the slowness, then it's probably worth adding functionality for it.

A workaround for the moment might be to add a dummy singleshot analysis routine that does nothing, uncheck all other singleshot and multishot analysis routines, and then mark all shots as not done - lyse will run the do-nothing script on them and then slurp in the new data produced from the multishot routine.

philipstarkey commented 7 years ago

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

Well @shjohnst 's idea of saving it to the individuals I allready implemented. The problem just as @chrisjbillington said is that the dataframe isn't updated. I am currently running a empty singleshot routine to update and yes is is quite slow. I guess the change I proposed would be faster and solve the problem of the user having to mark all shots as not done.

Another idea, to reduce the read and write operations and increase speed, could be upon saving results to the h5 file also creating a array, that is then added to the dataframe at the end of the routine. This would eliminate all read operations for the dataframe update. This kind of change could also bring a speed increase to singeshot routines.

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Sending the data back to the lyse GUI directly, avoiding going via the file (even though the data is also in the file) might be the way to go for acceptable performance. I suspect if I implemented something which merely marked the shots as requiring updating, it would still be almost as slow as what you're doing now. This would not be too hard I imagine.

Probably a modification to Run.save_result(), such that when it saves it to the file it also sends it to the lyse GUI (if it can establish that the current process is in fact running as a multishot analysis routine - probably via a flag set by lyse.data() when the user fetches the dataframe), or stores it somewhere for later sending to the lyse GUI, might be good.

philipstarkey commented 7 years ago

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

I also think modifying Run.save_result() to send to the GUI is the way to go. But why restrict this change to multishot analysis routines? This could also eliminate read operations after single shot routines.

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Hm, I suppose so, yes. So just modifying Run.save_result() to check "are we running in lyse", and if so, send back a dataframe once you're done processing.

What I was thinking was that multishot routines can also be run from outside the lyse GUI - but since they call lyse.data() the process knows that there is a running instance of lyse, and so could still update it even though the script is not being run from lyse directly. Whereas a singleshot routine running outside lyse can't know for sure that lyse is running or on what computer.

But, we should fix this for routines running from within lyse, at least. At present lyse has no idea about what's running outside of it, and does not respect changes to the files that it didn't explicitly initiate, so this would be an improvement even if it doesn't make lyse 100 % aware of all changes made to files.

philipstarkey commented 7 years ago

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

It's quite easy making analysis routines aware of running in lyse:

to init.py add:

#!python

    def in_lyse():
        return False

and to analysis_subprocess.py's AnalysisWorker class init() add

#!python

        def in_lyse():
            return True
        lyse.in_lyse = in_lyse

This is a bit hacky but works great. Leaving the problem of updating the dataframe.

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Oh, sure, and in fact there's a variable that already does that: it's called lyse.spinning_top...a lame reference to the movie Inception. Probably should give it a better name.

Whilst it's easy to tell if a routine is running from within lyse or not, the distinction I was getting at was that multishot routines can get information about a running lyse instance even if they are running from outside it, because the user calls lyse.data() (potentially specifying a hostname other than localhost), which connects to a running lyse instance and returns the dataframe. Whereas a singleshot analysis routine running outside lyse does not make contact with lyse. So what I was getting at is that it's possible to ensure multishot routines outside of lyse also inform lyse about changed data, whereas it is not so easy with a singleshot analysis routine, since it does not make contact with lyse and lyse is not necessarily even running. But I think we should probably just not support keeping the lyse interface updated when running routines outside lyse. If you're running routines outside lyse you're on your own!

philipstarkey commented 7 years ago

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

Nice naming! Made me chuckel. You should definitely keep the name but maybe add a comment on what the variable is for.

I now see what your getting at. And I would propose that we neglect everyone running scripts outside of lyse and in lyse simultaneously. I don't think this should be too many people and everyone else should be just fine with the change. This would also allow for a easy implementation by using a variable like lyse.spinning_top to store a Qt Signal that can then be emited on every save to send the updated information back to the GUI. Or does anyone have a better solution for sending data back?

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

It wouldn't be a Qt signal, as analysis routines are actually running in their own subprocess even when lyse is running them. I do have an interprocess signal system that we use for some things like that, but I think the solution is much simpler. The analysis subprocess already needs to say "done" to lyse - it would now say "done, and by the way, here is a dictionary of values you should update or add to your dataframe". The dictionary would be added to in each call to Run.save_result() if spinning_top was True. The dictionary itself would live in the lyse module, but probably be None or not exist if not running from within lyse. The wrapper code that calls your analysis routine would reset the dictionary to an empty one prior to each run of your code.

Then lyse would need to grow a little bit of code to update the dataframe and GUI with this data.

I see no remaining problems with implementing the feature this way, and I'm happy to do the coding for it, though I can't of course guarantee when! This change would I think speed up the simplest single-shot analyses for everyone quite a bit, since lyse would no longer need to read the HDF5 file after each one.

philipstarkey commented 7 years ago

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

Even better. If you don't beat me to it I'll give it a try after finishing up my runviewer modification.

philipstarkey commented 7 years ago

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

I implemented the solution we discussed. I'd be really happy about feedback and proposals for improvement before creating a pull request. UpdateDataframeNoRead

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

This is looking great! I have some minor quibbles but it seems to me to be basically the ideal way to implement this.

You should make a pull request, and we can discuss the specifics there (and seek approval from others).

philipstarkey commented 7 years ago

Original comment by Jan Werkmann (Bitbucket: PhyNerd, GitHub: PhyNerd).

changed state from "new" to "resolved"

resolved as of commit 006bdf08d52f59fd250a513217f73cf8d7a19fb1

labscript-suite-temp-2 / lyse

Allow Multishot Routines to add to dataframe #22