Open SpheMakh opened 4 years ago
First of all, as a thought experiment, I suggest we try to formulate this as a generic YML validator that can replace PyKwalify, so that we can use it for Caracal too.
Things like the io
attribute would be Stimela-specific extension attributes. Extension attributes aside, is a Stimela cab parameters list the same kind of beast as a worker schema in Caracal? If yes, then we're making one tool for both those jobs.
Agreed, but let's just do this. It will simplify the integration of stimela and caracal
is a Stimela cab parameters list the same kind of beast as a worker schema in Caracal?
In principle yes.
Second, a major feature I would like to see is include tags, to include YML files into YML files. There are some straightforward implementations discussed here, we could just reuse them or cut and paste: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another
Use cases for includes:
avoiding duplication (which means avoiding errors!) For example, Stimela has a whole family of casa-derived cabs with mostly duplicate parameters. These could all be defined in one place (in the base casa cab), and included into the derived cabs.
likewise, avoiding duplication in Caracal schema files
structuring Caracal config files better. For example, I have a fairly elaborate inspect worker config which I've customized (differently) for MeerKAT and uGMRT data. I continue to tweak it, but changes are increasingly difficult to propagate consistently. If I could just include it into my Caracal configs instead of cut-and-pasting it, this problem goes away.
perhaps more fancily: a Caracal schema could choose to import a (section of) a Stimela parameter file, to have a 1-to-1 mapping of parameters. This would mean tweaking the !include
implementation to specify a (sub)section to be included, but why not?
I volunteer to get a start on this, since I'm the one who seems to be accumulating the requirements...
Here is a first draft. The include tags don't really work for what we want, but we can just have an include section in the definition
schema: 2.0.0
task: casa_gaincal
base: stimela/casa:1.6.9
binary: giancal
description: Determine temporal gains from calibrator observations
prefix: null
include_inputs:
- ../misc/casagain.yml # this file contains parameters that are common between gain calibration tasks
include_outputs:
- ../misc/casagain.yml
inputs:
gaintype:
info: Type of gain solution
type: str
default: G
choices:
- G
- T
- GSPLINE
- K
- KCROSS
- XY+QU
- XYf+QU
splinetime:
info: Spline timescale(sec); All spw's are first averaged.
type: float
default: 3600.0
npointaver:
info: The phase-unwrapping algorithm
type: int
default: 3
phasewrap:
info: Wrap the phase for jumps greater than this value (degrees)
type: float
default: 180.0
smodel:
info: Point source Stokes parameters for source model.
type:
- List[str]
- List[float]
save_result:
info: Name of pickle file to save the task result into
type: str
outputs:
pickled_gains:
name: "{inputs.save_result.value}"
location: "{outdir}"
required: false
OK, so you're suggesting that any include_XYZ
is automatically loaded and added onto XYZ
, right?
But then why not do the same via an include_yml
keyword that can appear inside any mapping? E.g.
inputs:
include_yml:
- ../misc/casagain.yml
gaintype:
info: Type of gain solution
type: str
Or perhaps (I still want to be able to pull specific sections in):
inputs:
include_yml:
path: ../misc/casagain.yml
section: section_name
Also, I would suggest supporting "{}"-substitutions inside included files. I.e. if casagain.html
contains "{foo}" and "{bar}" at any point, these can be substituted in:
inputs:
include_yml:
path: ../misc/casagain.yml
section: section_name
vars:
foo: 1
bar: 2
OK, so you're suggesting that any include_XYZ is automatically loaded and added onto XYZ, right?
Yes
Or perhaps (I still want to be able to pull specific sections in):
inputs:
include_yml:
path: ../misc/casagain.yml
section: section_name
This is more versatile, especially if we want to use the same specification for caracal.
Also, I would suggest supporting "{}"-substitutions inside included files. I.e. if casagain.html contains "{foo}" and "{bar}" at any point, these can be substituted in:
Agreed.
I should've made this explicit in my previous comment, but you can also substitute some stimela variables like msdir indir outdir
via "{}"
I should've made this explicit in my previous comment, but you can also substitute some stimela variables like msdir indir outdir via "{}"
Yeah, I only saw that after I posted. That's a very useful feature (see e.g. CubiCal's default --j-save-to settings).
BTW, if you put yml
directly after your opening triple quote, you get syntax highlighting (see e.g. I edited your post above).
Also better to specify positional parameters in the YAML
inputs:
gaintype:
positional: false # <----
info: Type of gain solution
type: str
@paoloserra maybe you want to have some input here as we are thinking of also replacing pykwalify in caracal with this specification.
@SpheMakh I've had a look but I'm not sure what exactly (if anything) we would need to change in the caracal schema files.
Well the schema should be backwards-compatible for the most part, but it should allow for more options and flexibility. What sort of limitations were we running into with the caracal schemas?
Personally, I think the include
capability will be very useful.
Off the top of my head I can think of permitting a true None
value, but that's minor.
So include
is a way to give common settings to, e.g., multiple runs of the flag
worker within a single caracal run?
I'm not aware of significant limitations. We had to use example
to set the default values, which was a little odd, but that's all.
Something that often takes too much time when making a new worker is writing the parameters, which are already defined in stimela. With this, we can just use the include
feature and give the appropriate cab file. The changes to caracal would be minor but would make life easier moving forward.
I'm probably missing something, but CARACal has its own parameter names and default values, which do not always coincide with those in Stimela.
And anyway, very few parameters are actually required in CARACal, most workers can be added to a pipeline run adding very few parameters to the CARACal config file. This is shown by https://github.com/caracal-pipeline/caracal/blob/master/caracal/sample_configurations/minimalConfig.yml .
So I think we need to be careful about transferring Stimela par names / default values to CARACal.
I think for future packages (and perhaps Caracal R2.0), it would make a lot of sense to lean on include heavily. Some of the current implementations I find annoying, confusing and error-prone. For example, it would be a lot "cleaner" if wsclean settings were their own subsection in selfcal, something like:
selfcal:
imager: wsclean
img_wsclean:
include_yml:
path: stimela/wsclean.yml
section: parameters
robust: -1 # overriding whatever was pulled from stimela
This way Caracal could be free to have different and perhaps more sensible defaults (and the {}-subst mechanism would allow for various games with the remapping of parameter names), but at the same time there would be a consistent 1:1:1 mapping between wsclean parameter names, Stimela parameter names, and settings in the appropriate subsection, which makes life easier for everyone, no?
(CubiCal and the coming QuartiCal is another package-of-100-parameters that could benefit from a consistent naming mechanism...)
Another thing that include
would be great for is providing "standard libraries". For example:
flag:
include_yml:
path: caracal/meerkat-flagging.yml
section: aggresive_4k_flagging
inspect:
include_yml:
path: caracal/meerkat-inspect.yml
section: detailed_1gc_plots
Not saying things should literally be implemented like this -- just throwing food for thought out there.
I don't know about parameter name matching. Stimela tends to stick to the native parameter names. CARACal does not need to. In fact, we've changed most of the parameter names in order to have some internal CARACal consistency. So I'd say a Stimela-Caracal match is never going to happen.
I fully agree that the CARACal selfcal
parameters need a lot of cleaning up -- things need to be more modular.
Overall there's plenty to improve in CARACal, and it's good to think of this for a 2.0 release. But I also think CARACal should target users who know nothing about Stimela rather than Stimela developers. So parameter names, nesting etc etc should not follow Stimela's unless that's what makes more sense for a CARACal user.
But I also think CARACal should target users who know nothing about Stimela rather than Stimela developers. So parameter names, nesting etc etc should not follow Stimela's unless that's what makes more sense for a CARACal user.
Well, it's not about following Stimela naming. Rather, for complicated packages such as WSClean, DDF and CubiCal, I'd like to (ideally) see a system where Stimela names are consistent with the native package parameter names, and there's a way to import this schema into Caracal "as is". Ideally, users who know WSClean should be able to piss out Caracal configs at will without having to look up what we decided to rename each parameter to. If you're going to give the user rope, make it familiar rope.
This does not have to preclude a simple and self-consistent subset of parameters to buttress the Caracal newbie.
I also lean towards keeping the names the same as the packages, but this will be optional. We can have an optional overrides
feature
selfcal:
imager: wsclean
img_wsclean:
include_yml:
path: stimela/wsclean.yml
section: parameters
pixel:
info: Image size in pixels
overrides: size # <--------------
.
.
.
Thinking about this a bit more, it would be annoying to have to change standard imaging parameters names like cellsize npixel niter
when I decide to use casa or ddfacet instead of wsclean for imaging. So we should definitely have the option of diverging from stimela and native package names.
But that divergence can be accomplished via override and {}-substitutions. E.g.:
selfcal:
cellsize:
...
img_wsclean:
include_yml:
path: stimela/wsclean.yml
section: parameters
scale: # "scale" is a wsclean-specific paramater
default: {selfcal.cellsize} # ...which we set from a Caracal parameter by default (but the end user can still override)
Thinking about this a bit more, it would be annoying to have to change standard imaging parameters names like
cellsize npixel niter
when I decide to use casa or ddfacet instead of wsclean for imaging. So we should definitely have the option of diverging from stimela and native package names.
Yes, that was one of the reasons.
Actually, I don't think we need to reinvent the wheel here. Both includes and substitutions (of a sort) are already standard YaML features. See merge and anchors, https://yaml.org/type/merge.html.
The somewhat annoying thing is that anchors have to be explicitly set with "&" at point of origin. And it knows nothing about including files -- the YaML code in the file has to be loaded in the stream before. So the example above would have to look something like
selfcal:
cellsize: &selfcal_cellsize
...
img_wsclean:
<<: *stimela__cabs__wsclean__parameters
scale:
default: *selfcal_cellsize
(NB, this is cartoon code, it can't actually work like that, but I don't want to cloud the picture with details...)
There may be a way to automatically set anchor names if we use ruamel.yaml
, as it has a set_yaml_anchor
feature. The loading code would then look something like:
wsclean.yml
would end up under stimela: cabs: wsclean
.stimela__cabs__wsclean__parameters
.In principle, this is a generic pattern to support a layer of config files Y that can reference sections from a layer of config files X via automatically-generated anchor names. You can keep on adding layers, the only limitation is that each layer can reference auto-generated anchors from the previous layers, but not from within itself.
This sounds complicated, but I think I can make a simple proof-of-concept on am experimental Stimela branch.
I think Stimela itself needs this: there's currently a lot of repetition in the cabs parameters (for example, the entire casa_xxx family has many common parameters) which could be eliminated by defining common stuff in config files under base
, and pulling this into config files under cabs
using the merge mechanism. In terms of my previous comment, the base configs would be layer X (loaded into a stimela__base
namespace), and the cab configs would be layer Y (stimela__cabs
). And Caracal configs could then layer on top of that using the same mechanism.
OK, I got tired of fighting Caracal and data and flags, so I spent a day or so hacking on a couple of proofs of concept. Checked it into the configuratt
branch.
This uses anchors and aliases, as discussed above, plus some extra code to automatically set anchors, and merge YML configs. See here, and run as python stimela/configuratt/demo.py
.
I've made example .yml
files for the casa base image and for the applycal cab to illustrate.
Pros: the syntax is nice and clean, and is a standard YML feature, e.g.:
inputs:
<<: *stimela-base-casa-mssel_inputs
docallib:
info: "Use callib or traditional cal apply parameters"
type: bool
default: false
Cons: the functionality is limited, and it all feels a bit klunky. Also, applycal.yml
can't be read on its own by a standard YML parser because it contains these undefined aliases which my code generates auto-magically. It can only be read together with casa.yml
by the magic code in demo.py
.
This is a neat new package, and provides almost all the features we need (plus it's evolving quickly, so will probably have the rest soon). It's a lot more powerful than that: you can specify a schema programmatically using the standard Python typing
module, and the resulting schema can be a lot richer than what pykwalify offers. See https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/configuratt/demo_omega.py, it's practically self-documenting.
The >>
feature is not quite in it yet, but it has ${a.b.c}
interpolation, and also direct command-line support, and lots of other good stuff. See applycl.yaml for a taste (note that the demo files for approach 2 are called yaml
as opposed to yml
).
Pros: powerful and elegant schemas, explicitly designed for the situation where a config is spread across many files and we want to merge them all together, built-in interpolation (${section.variable}
) is very useful. Seems to be developing quickly. Explicit command-line support, we can lose all that horrible parsing code.
Cons: some schema errors have been utterly incomprehensible (if you thought pykwalify was bad...) It is also an option to use omegaconf without the schema mechanism (essentially, it's then a fancy YML reader providing variable interpolation), and then on top of that use something like pykwalify to process and validate the config afterwards.
The >> feature is not quite in it yet,
After a discussion with the developer, I realized this is something relatively simple to bolt on externally, so I checked in a modified demo. This works and looks quite nice now:
inputs:
_use:
- base.casa.params.mssel_inputs
docallib:
info: "Use callib or traditional cal apply parameters"
type: bool
default: false
I think this sort of thing would be a very nice addition to Caracal R2. I'm finding my overall collection of config files to be very repetitive and error-prone. I'd really like to develop a library of standard config snippets, which could be pulled in piecewise, and then only the necessary parameters modified. Something like:
crosscal:
_use: lib.meerkat.crosscal.4k
secondary:
order: KGIAKF
inspect:
_use: lib.meerkat.plots.cal.phaseballs, lib.meerkat.plots.cal.spectra, lib.meerkat.plots.cal.detailed
shadems:
plots:
my_extra_plots:
- some_extra_plots
P.S. This needs omegaconf master to run, as it uses some features from the upcoming 2.1 release.
@o-smirnov You mention pykwalify vs typing. I was wondering if overloading a parameter in a schema, by which I mean allowing it to be multiple types, e.g.:
all allowed, but not a float, a list of floats, a list of n+1 or n-1 strings etc.
Good question, but it's turning into a Caracal discussion, so let's discuss in in https://github.com/caracal-pipeline/caracal/discussions/1303 rather.
I am not particularly concerned where a validation takes place, so this is good for me, but in principle this could (should?) happen as early as possible, i.e. Stimela (where the question turned up for me).
I go for approach 2. OmegaConf is exactly what we need
I think this sort of thing would be a very nice addition to Caracal R2. I'm finding my overall collection of config files to be very repetitive and error-prone. I'd really like to develop a library of standard config snippets, which could be pulled in piecewise, and then only the necessary parameters modified. Something like:
Yes, this is the direction we should move. For example, the casa base file I use is below. Let me incorporate the mergeConf shindig to my branch
vis:
type: Directory
info: Name of input visibility file
default: null
required: true
positional: false
prefix: null
field:
info: Select field using field id(s) or field name(s)
type: str
default: null
spw:
info: Select spectral window/channels
type : str
default: null
selectdata:
info: Other data selection parameters
type: bool
default: true
timerange:
info: Select data based on time range
type: str
default: null
uvrange:
info": Select data within uvrange (default units meters)
dtype: str
default: null
antenna:
info: Select data based on antenna/baseline
type: str
default: null
scan:
info: Scan number range
type: str
default: null
observation:
info: Select by observation ID(s)
type: str
default: null
msselect:
info: Optional complex data selection (ignore for now)
type: str
default: null
That's almost word-for-word what I wrote for the omegaconf demo: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/cargo/base/casa/casa.yaml
I go for approach 2. OmegaConf is exactly what we need
OK then let's start a branch and get moving. Setting up the infrastructure is very little work, I think most of it is already in my demo. Porting all the parameters.json
files will take some effort, but they'll be so nice and compact when it's done, it's bound to be aesthetically rewarding.
Implementing #692 will also be pretty straightforward then.
Let's continue in your configuratt branch
OK cool, well I moved all the generic config merging infrastructure into the configuratt module, the init now is as simple as it can be: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/configuratt/demo_omega.py#L109.
If you check the schemas and start working on the image and cab ymls, I volunteer to work on #692.
Also, #598 is naturally merged into this...
This is probably better served by a discussion thread: https://github.com/ratt-ru/Stimela/discussions/696
And we should include https://github.com/ratt-ru/Stimela/issues/598 in these deliberations.
The cab specification has a few obvious flaws. These ones are particularly bad:
parameters
list was specified as: