Saving of hg info slows down compilation

philipstarkey commented 7 years ago

Original report (archived issue) by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

Pull request #7 introduced the saving of hg (mercurial) repository version information into the hdf5 file. However, tests on my PC indicate this adds an additions 3 seconds onto compilation time (original compilation time was ~0.65 seconds, now ~3.5 seconds).

We should investigate this and correct it.

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Could you attach or send me a labscript file and globals file(s) I could profile this with?

philipstarkey commented 7 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

I've emailed them to you as I don't think we want to be sharing our current globals and labscript logic publicly :)

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Thanks! The fundamental problem is that each call to mercurial takes a non-negligible amount of time, about 0.1 seconds on my computer. That times three commands per file, times however many files you have imported from your lascriptlib, adds up quick.

Testing your script, it compiles for me in 0.6 seconds with the mercurial stuff disabled, and 13.7 seconds with it enabled, but with the search path replaced with labscript_suite rather than labscriptlib, because I don't have anything in my labscriptlib (your script had a few dependencies on monash krb modules in labscriptlib that I don't have, I commented them out). This resulted in it looking at 42 files.

This is mostly, as far as I can tell, Python interpreter startup time and imports and stuff that mercurial is doing, since each of these calls is a separate Python subprocess (mercurial being written in Python). Now, you can make mercurial calls directly with the mercurial's API, like so:

#!python

import mercurial.dispatch
# should os.chdir() into the folder that the file is in and then back after
request = mercurial.dispatch.request(['log', '--limit', '1', 'some_file.py'] , fout=open('/dev/null', 'w'), ferr=open('/dev/null', 'w'))
mercurial.dispatch.dispatch(request)

instead of:

#!python

process = subprocess.Popen([['hg', 'log', '--limit', '1', 'some_file.py'], cwd=folder,
                                                   stdout=subprocess.PIPE, stderr=subprocess.PIPE, startupinfo=startupinfo)
info, err = process.communicate()

Except you would replace fin and fout with StringIO objects or something to capture the output (I put /dev/null there just so I wouldn't be measuring how slow runmanager's outputbox might be coughing out all that text. Turns out the outputbox is pretty fast though).

When I do this, the compilation takes 1.9 seconds. Much faster, but still a factor of three on the compilation without saving hg data. For you though, looking at how it approximately scales, it would probably only add 0.1 to 0.2 seconds to your shot compilation times, because you must not be using that many files in labscriptlib (definitely not 42).

So. Do we want to do this? One issue is that mercurial as installed on the system might not be mercurial as importable from Python, since if you install tortoisehg it probably ships with its own bundled Python I would guess. Nevertheless mercurial is conda installable, so labscript doesn't have to use system mercurial, it could use conda mercurial. It could try to "import mercurial", which would give conda mercurial if it exists, and if that fails it could just give up even though the system might have mercurial installed elsewhere.

So that's where it's at I think. Mercurial is too slow, we can speed it up but only if the mercurial library is importable from the running Python interpreter, which system mercurial on Windows is almost certainly not (on linux it is if you're using system python, and isn't if you're using anaconda).

Another option would be to do "hg status" on each repo in labscriptlib instead of on each file. Repos can be identified because they contain a .hg folder. So you could look at what modules are imported from labscriptlib, then do a bit of a filesystem walk covering all their ancestor directories up to labscriptlib looking for .hg folders, then call hg status etc on those repos exactly once each, and save that data to the h5 file on a per-repo basis instead of per-file.

philipstarkey commented 7 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

Thanks for looking into this!

I'm trying to think of other solutions to this. In some ways, I'd like to see the code come out into runmanager so that we can turn it into a plugin and support other version control systems (like git) which people might be using for their labscriptlib/analysislib code. However, I don't see any nice way to know what modules are in use prior to running the script, and so there is no way to save time.

Another hacky solution could be to spawn a process whose job it is to asynchronously get the hg status information and write it to the h5 file. This could be done while BLACS is using the shot file, since h5lock will protect everything. But a bit hacky since the user wouldn't know when it had actually finished writing the information...

I suppose the batch_compiler process could cache all of this information for labscriptlib and pass it to labscript so that labscript can pull out the information it needs for the modules in use? the batch_compiler could watch the folder and update the information when files changed (this assumes that labscript can't be run without the runmanager batch_compiler which I think is correct?). Or maybe runmanager should be doing the monitoring and passing the information to the batch_compiler so that it can display the hg repo status to you in the GUI to remind you to commit?

Not sure, trying to think outside the box!

Maybe we should just try the python hg library approach (with a fallback to command line hg) for now and see how that goes? Maybe add a flag in runmanager to turn it off completely for those who need fast compilation speed?

philipstarkey commented 7 years ago

Original comment by Ian B. Spielman (Bitbucket: Ian Spielman, GitHub: ispielma).

I really think that software driven slowdowns are not ok. So maybe this is a good time to think about organizing what is written into the h5 file into to groups: things that must be done to compile the shot so blacs can use it, and things that must be done before the file is archived. Anything in the latter camp can still be in progress after the shot is submitted.

philipstarkey commented 7 years ago

Original comment by Ian B. Spielman (Bitbucket: Ian Spielman, GitHub: ispielma).

Second and unrelated to first comment. Why are we using revision control to solve this problem? Why not save the full text files? In a compressed h5 file this will be quite compact and will allow for offline consideration on machines that do not have the files that are referenced by the revision control system. Ian's point: bits are cheap, and dependencies come with a price.

Maybe this is in support of the give choices suggestion where store it all, hg and git are all options.

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

The file contents themselves are saved, but often a diff is more informative - you're interested in asking "what changed about this code compared to other code we were running around that time?" I know for example that diffing globals of shot files against the globals in runmanager has been invaluable at Monash -- I believe that there is a poster on the wall of Lincoln's lab warning of the terrible fate that awaits the poor soul who forgets to diff their globals.

Given that the files themselves are saved though, the diffs and version control log is just icing on the cake, and shouldn't replace documenting what you're doing in a lab logbook. So I don't mind ditching it for the sake of speed, but it is still nice, so I'm interested in how it can be done without the time cost of calling mercurial at compile-time (and I'm not sold on using the mercurial API at compile-time, it's less robust than using system mercurial executables, and clearly doesn't scale to large number of files in any case).

I think rumanager or batch_compiler maintaining a cache might be workable. It's a shame INotify isn't a thing on Windows, if it were we could be sure to be notified of changes in a race-free way and not have to resort to periodic polling of the filesystem. But FileWatcher and ModuleWatcher have proved to be performant, and they watch large numbers of files, so I think watching labscriptlib with FileWatcher is probably fine (the cache should be saved to disk too so that you don't pay a 14 second startup time in runmanager for 42 files!). When done with a cache, the number of hg calls can be reduced considerably, so that you won't pay 14 seconds even if you do a "hg clone" of a new repo in labscriptlib or something. You only need to call hg log once per repo, not once per file, and you can call hg status once per repo as well to get a list of files and their statuses, rather than once per file. Then, calling hg diff per file is still useful, but you can use the results of hg status to know in advance which files will have no diff, which means you won't have to call it in that case. All the data can be obtained with a small number of mercurial calls, ensuring that if the user clicks "engage" 2 seconds after a mercurial clone, they will still get up-to-date data (runmanager should make a final call telling FileWatcher to check timestamps at that time to be sure though).

I have some experience with doing things similar to this, so I could use that code as a starting point, I think it wouldn't be very hard to get what we want.

So think I like this. I propose we disable the hg info (still saving the file contents for now) and leave this bug report open pending a solution like that, which I'm happy to keep assigned to me though I can't promise when I'll get around to it (perhaps after the current round of more science-related changes I'm aiming to get done).

philipstarkey commented 7 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

Can we disable it in a way that allows someone to re-enable it from their labscript file. For instance, disabled by default, but then you could do something like:

#!python

import labscript
labscript.compiler.save_hg_info = True

or similar at the top of your experiment script?

I'm only suggesting this because I don't know how much the Monash spinor lab want this functionality enabled. @rpanderson was the one who made the pull request (pull request #7) in the first place to update the code from supporting SVN to mercurial, so I'm guessing there was a reason for that, in which case a method to enable this for critical cases (with the knowledge it slows down compile time) seems like the optimal, temporary solution.

I'm also happy to unassign this and leave it to be picked up by whoever wants it when they have time.

philipstarkey commented 7 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

removed "557058:cbf1bc43-1dc2-477b-9e25-1a8f40fd7ee3" assignee_account_id
removed "cbillington (Bitbucket: cbillington, GitHub: chrisjbillington)" assignee

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Ah that sounds like a good way of turning it off, yep. And not remove the existing code until there's something to replace it.

philipstarkey commented 7 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

FYI: We could maybe look at using QFileSystemWatcher for a better platform agnostic way of monitoring files. It uses INotify on linux, and I believe that windows has something similar that presumably Qt is using. I think I propsed using the windows API many years ago but there was only a dodgy 3rd party library available to interface with it which we decided to avoid. But this was way before we were using Qt, so maybe we could revisit it sometime.

philipstarkey commented 7 years ago

Original comment by Ian B. Spielman (Bitbucket: Ian Spielman, GitHub: ispielma).

I really like the idea of having a script-level way of turning this on and off. I'm not in love with the idea of introducing dependancies on QT at the core level, however. One of the dreams some of our postdocs (one postdoc in particular...) have is of fully separating "core functionality" with "UI" where everything could be done at the command-line if so desired with exactly zero GUI. I agree with the utility of such a logical separation even if I would want a GUI personally.

philipstarkey commented 7 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

Oh yes, definitely. I was only suggesting using QFileSystemWatcher for the GUI applications that actively monitor filesystem changes (like the BLACS GUI does to prompt about connection table recompilation and the proposal to have runmanager do it for repository status if we decide to go in that direction).

That said, a lot of Qt (everything in QtCore) is designed to run in command line applications as well as GUI applications. Qt have a QCoreApplication class specifically for starting up the Qt event loop in command line software that doesn't have a GUI. But I think it's a pain to compile Qt on headless systems if they are truly headless (aka don't have a graphics card at all), so I'm still keen to avoid adding Qt as a dependency to anything that isn't going to always be tightly coupled to a GUI.

philipstarkey commented 7 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

I also would not have a problem continuing to use FileWatcher, but having Filewatcher (perhaps optionally) wrap QFileSystemWatcher. A single, contained dependency like that is flexible, you can strip it out later easily and all programs that depend on that code will change their behaviour accordingly. I'm not saying that's what we should do, just that I wouldn't have a problem with it since it would be a single-point of change to revert it.

I'm skeptical though, that QFileSystemWatcher is sufficiently general for us. For example, most of these things don't work on network drives (FileWatcher does). I was also unaware that Windows had an INotify equaivant, and the docs for QFileSystemWatcher don't specify what mechanism they use on Windows.

Bottom line I think is that sticking with FIleWatcher and its polling is a good idea unless we see a performance reason not to. But, I'm interested to know how QFileSystemWatcher works, so that we can make an informed decision. And depending on Qt without a GUI is also fine by me, but it is admittedly one of the more complicated dependencies to make sure you have (conda manages to hide this pretty well though!), so that does count against it.

philipstarkey commented 7 years ago

Original comment by Philip Starkey (Bitbucket: pstarkey, GitHub: philipstarkey).

Continuing the offtopic conversation: It looks like QFileSystemWatcher uses this on windows (and also a lot of QThreads).

labscript-suite-temp-2 / labscript

Saving of hg info slows down compilation #38