Open msyriac opened 4 years ago
Hey @msyriac ! Thanks for pinging us about this (and sorry for the delay). So, initially this was meant to be a BB-only thing, but only because I didn't want to push this on other groups unless they found it useful. If you think it'd be useful for you guys, then I think what we should do is move all the pipeline-building code into a repo with a more generic name (+ improve the documentation), and move all the BB pipelines into another repo.
So indeed, although the current answers are 1. yes and 2. no, I think they should be 1. no and 2. yes, so I'm happy to do this. (b) sounds good, happy to include this.
How would you like to proceed?
We did consider it at the beginging, but postponed its usage to a later stage, when we would be able to weight better pro and cons based on concrete cases-of-use. My current feeling is that this pipeline makes heavy objects unpractical or expensive to recycle and move around . So, using a master script seemed more natural so far. I am happy to reconsider using BBPipe (or whatever its name will be) if the experts think differently
@dpole OK, thanks. That's interesting... What do you mean by "heavy objects" in this context? In principle this is kind of the problem that this is meant to solve, but maybe I'm missing something.
It is more likely that I am missing something and I can actually achieve what I want with BBPipe. So far when I analyzed LAT maps I was often close to saturating the node memory (and I wanted to avoid using MPI for simplicity).
Controlling memory usage and when objects are read or written seemed easier to me if I used a master script instead of a pipeline builder. Any comment on this is more than welcome. In particular, how easy it is to tell the pipeline if a given stage should pass the outputs directly to the next stage instead of writing them and passing only the location on disk?
On the other hand BBPipe seemed very convenient when you have non-trivial dependency graphs between stages, but that was not my case yet. In that case, can the scheduler take into account the memory necessary for each stage?
@msyriac , sorry for cluttering the initial discussion
This is a very useful discussion @dpole ! Every stage in BBPipe needs to write its outputs to file to communicate with the next stages. So if you want to pass data between different steps without writing to disc, you need to make those steps part of the same stage. I don't think a script can do any better than that. The main advantage of things like bbpipe (which by the way is not really my code, but something that I've recycled from LSST), is that it makes it very easy to add/remove stages, make sure that the pipeline structure makes sense and in general keep things tidy. This is the main reason why I moved away from bash scripts for anything that is mildly complicated (god knows how many bugs I've introduced by search-replacing in those), although I admit that scripts are easier to write up when your pipeline is simple.
In any case, as I said above, I didn't intend for this to be imposed on anyone else, but if other groups find it useful, I'm happy to move this to a more generically-named repo and to separate out the BB-specific parts. Let me know @msyriac !
Thanks, so stages communicate only through disk. This can be inconvenient when working with high resolution maps. I'll try to make a concrete example both to be sure I understand and because it could be relevant to @msyriac (so that I can pretend that it is meaningful to have this discussion here).
Suppose that I want to run my pipeline over many MCs. The signal is always the same but the noise changes. Signal and noise are two inputs of the analysis stage. Can I avoid loading from disk the signal every time I run the analysis stage? (e.g. by having a signal-loading stage that pushes the input directly to the analysis stage without touching disk)
I think the existing structure here is useful and solves many problems in pipeline development so its worth discussing its potential limitations here -- maybe we can extend it as necessary.
As a concrete example, I'll take part of the [component separation + lensing] pipeline. Here's a sketch for reference:
Let's focus on this part of the pipeline:
This part is typically run in a loop over a large number of sims. The starred components when used in a production run are very resource intensive so it is useful to have them output to disk so that some parts can be re-run as necessary (question for @damonge : is re-running intermediate stages allowed by BBPipe?). The number of on-disk products also scales as Nsims and not as Nsims * N_tubes, which makes saving them to disk feasible. (This is why "sims" and "src_sub" should not save to disk.)
So we notice two things given the fact that currently BBPipe does not allow for stages that don't write to disk:
I don't think this is necessarily a show stopper. For example, I could have a few different pipeline stage defintions that I use based on whether this is a test run or a production run (e.g. stages ILCCoadd and OptFilter vs a single IsotropicCoaddFilter stage). I think Davide's situation is also just addressed by having different stage defintions for test and production runs possibly?
Hi @msyriac (and sorry again for the delay)
to answer some of the questions: a) No, BBPipe doesn't have stages that can be rerun a number of times as part of the same pipeline run. This is not a bad idea (since many pipelines will imply running things on a bunch of sims). The current way in which I deal with this in BB is to have for-loops inside each stage running on the simulations. Each stage also outputs the results from each simulation (either everything into a single file or into separate files that are then listed in an output txt file). This is arguably a bit clunky, so I'd be happy to include a "n_iterations" parameter on each stage that reruns it where needed. In your example, you'd need to create separate stages that generate any one-off resource-intensive data products just once, and then pack the things that need to loop over simulations into other stages with an "n_iterations" parameter or something like that. Does this make sense? There's a chance I'm misinterpreting what you wrote completely. If what you meant is "do pipeline stages get skipped if their output data already exists when you rerun the pipeline?", the answer is yes. b) Regarding non-disk-writing stages: I think this could be possible to implement (I'd need to think about the details a bit). The only straightforward way I see of doing so is to have "fictitious" independent stages that run one after another as part of the same MPI process (i.e. avoid having to communicate with each other to see when one has finished so the other can start). I don't like the idea that much though.
If the answers are 1. no and 2. yes, then (a) I'd like to start using this for lensing and maybe encourage @jcolinhill and @dpole to also think about using it for LAT component separation and (b) I have a feature suggestion that I can implement in a PR: have the logger outputs include git commit hashes and/or package versions for specified modules. (Really important for reproducible pipeline runs) This would work as follows: add an option in the main config to specify a set of python modules; for each of these modules, before the stages begin, the script checks if the modules are installed packages or git repos. If the former, note and log the package version number. If the latter, the script requires those repos to have clean working directories and errors out if not. If all specified repos have clean working directories, then note and log the branch name and commit hashes.
The definition of "clean working directories" could include the more restrictive requirement of no untracked files.
Let me know if this sounds like something good to add here, or if this is already an existing feature.