cab specification 2.0 - Githubissues

SpheMakh commented 4 years ago

The cab specification has a few obvious flaws. These ones are particularly bad:

does not distinguish between files and directories
does not allow a pattern specification (pinging @bennahugo ;))

It would be more readable if the parameters list was specified as:

param1:
type: foo
desc: blah blah 
default: bar
io: input | output | mixed
param2:
.
.
.

o-smirnov commented 4 years ago

First of all, as a thought experiment, I suggest we try to formulate this as a generic YML validator that can replace PyKwalify, so that we can use it for Caracal too.

Things like the io attribute would be Stimela-specific extension attributes. Extension attributes aside, is a Stimela cab parameters list the same kind of beast as a worker schema in Caracal? If yes, then we're making one tool for both those jobs.

SpheMakh commented 4 years ago

Agreed, but let's just do this. It will simplify the integration of stimela and caracal

is a Stimela cab parameters list the same kind of beast as a worker schema in Caracal?

In principle yes.

o-smirnov commented 4 years ago

Second, a major feature I would like to see is include tags, to include YML files into YML files. There are some straightforward implementations discussed here, we could just reuse them or cut and paste: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another

Use cases for includes:

avoiding duplication (which means avoiding errors!) For example, Stimela has a whole family of casa-derived cabs with mostly duplicate parameters. These could all be defined in one place (in the base casa cab), and included into the derived cabs.
likewise, avoiding duplication in Caracal schema files
structuring Caracal config files better. For example, I have a fairly elaborate inspect worker config which I've customized (differently) for MeerKAT and uGMRT data. I continue to tweak it, but changes are increasingly difficult to propagate consistently. If I could just include it into my Caracal configs instead of cut-and-pasting it, this problem goes away.
perhaps more fancily: a Caracal schema could choose to import a (section of) a Stimela parameter file, to have a 1-to-1 mapping of parameters. This would mean tweaking the !include implementation to specify a (sub)section to be included, but why not?

o-smirnov commented 4 years ago

I volunteer to get a start on this, since I'm the one who seems to be accumulating the requirements...

SpheMakh commented 3 years ago

Here is a first draft. The include tags don't really work for what we want, but we can just have an include section in the definition

schema: 2.0.0
task: casa_gaincal
base: stimela/casa:1.6.9
binary: giancal
description: Determine temporal gains from calibrator observations
prefix: null

include_inputs: 
  - ../misc/casagain.yml # this file contains parameters that are common between gain calibration tasks
include_outputs:
  - ../misc/casagain.yml

inputs:
  gaintype:
    info: Type of gain solution
    type: str 
    default: G
    choices:
      - G 
      - T 
      - GSPLINE
      - K 
      - KCROSS
      - XY+QU
      - XYf+QU
  splinetime:
    info: Spline timescale(sec); All spw's are first averaged.
    type: float
    default: 3600.0
  npointaver:
    info: The phase-unwrapping algorithm
    type: int 
    default: 3
  phasewrap:
    info: Wrap the phase for jumps greater than this value (degrees)
    type: float
    default: 180.0
  smodel:
    info: Point source Stokes parameters for source model.
    type:
      - List[str]
      - List[float]
  save_result:
    info: Name of pickle file to save the task result into
    type: str 

outputs:
  pickled_gains:
    name: "{inputs.save_result.value}"
    location: "{outdir}"
    required: false

o-smirnov commented 3 years ago

OK, so you're suggesting that any include_XYZ is automatically loaded and added onto XYZ, right?

But then why not do the same via an include_yml keyword that can appear inside any mapping? E.g.

inputs:
  include_yml:
    - ../misc/casagain.yml
  gaintype:
    info: Type of gain solution
    type: str

Or perhaps (I still want to be able to pull specific sections in):

inputs:
  include_yml:
    path: ../misc/casagain.yml
    section: section_name

Also, I would suggest supporting "{}"-substitutions inside included files. I.e. if casagain.html contains "{foo}" and "{bar}" at any point, these can be substituted in:

inputs:
  include_yml:
    path: ../misc/casagain.yml
    section: section_name
    vars:
      foo: 1
      bar: 2

SpheMakh commented 3 years ago

OK, so you're suggesting that any include_XYZ is automatically loaded and added onto XYZ, right?

Yes

Or perhaps (I still want to be able to pull specific sections in):

inputs:
  include_yml:
    path: ../misc/casagain.yml
    section: section_name

This is more versatile, especially if we want to use the same specification for caracal.

Also, I would suggest supporting "{}"-substitutions inside included files. I.e. if casagain.html contains "{foo}" and "{bar}" at any point, these can be substituted in:

Agreed.

I should've made this explicit in my previous comment, but you can also substitute some stimela variables like msdir indir outdir via "{}"

o-smirnov commented 3 years ago

I should've made this explicit in my previous comment, but you can also substitute some stimela variables like msdir indir outdir via "{}"

Yeah, I only saw that after I posted. That's a very useful feature (see e.g. CubiCal's default --j-save-to settings).

BTW, if you put yml directly after your opening triple quote, you get syntax highlighting (see e.g. I edited your post above).

SpheMakh commented 3 years ago

Also better to specify positional parameters in the YAML

inputs:
  gaintype:
    positional: false  # <----
    info: Type of gain solution
    type: str

SpheMakh commented 3 years ago

@paoloserra maybe you want to have some input here as we are thinking of also replacing pykwalify in caracal with this specification.

paoloserra commented 3 years ago

@SpheMakh I've had a look but I'm not sure what exactly (if anything) we would need to change in the caracal schema files.

o-smirnov commented 3 years ago

Well the schema should be backwards-compatible for the most part, but it should allow for more options and flexibility. What sort of limitations were we running into with the caracal schemas?

Personally, I think the include capability will be very useful.

Off the top of my head I can think of permitting a true None value, but that's minor.

paoloserra commented 3 years ago

So include is a way to give common settings to, e.g., multiple runs of the flag worker within a single caracal run?

I'm not aware of significant limitations. We had to use example to set the default values, which was a little odd, but that's all.

SpheMakh commented 3 years ago

Something that often takes too much time when making a new worker is writing the parameters, which are already defined in stimela. With this, we can just use the include feature and give the appropriate cab file. The changes to caracal would be minor but would make life easier moving forward.

paoloserra commented 3 years ago

I'm probably missing something, but CARACal has its own parameter names and default values, which do not always coincide with those in Stimela.

And anyway, very few parameters are actually required in CARACal, most workers can be added to a pipeline run adding very few parameters to the CARACal config file. This is shown by https://github.com/caracal-pipeline/caracal/blob/master/caracal/sample_configurations/minimalConfig.yml .

So I think we need to be careful about transferring Stimela par names / default values to CARACal.

o-smirnov commented 3 years ago

I think for future packages (and perhaps Caracal R2.0), it would make a lot of sense to lean on include heavily. Some of the current implementations I find annoying, confusing and error-prone. For example, it would be a lot "cleaner" if wsclean settings were their own subsection in selfcal, something like:

selfcal:
  imager: wsclean
  img_wsclean:
    include_yml:
      path: stimela/wsclean.yml
      section: parameters
    robust: -1   # overriding whatever was pulled from stimela

This way Caracal could be free to have different and perhaps more sensible defaults (and the {}-subst mechanism would allow for various games with the remapping of parameter names), but at the same time there would be a consistent 1:1:1 mapping between wsclean parameter names, Stimela parameter names, and settings in the appropriate subsection, which makes life easier for everyone, no?

(CubiCal and the coming QuartiCal is another package-of-100-parameters that could benefit from a consistent naming mechanism...)

Another thing that include would be great for is providing "standard libraries". For example:

flag:
  include_yml:
    path: caracal/meerkat-flagging.yml
    section: aggresive_4k_flagging

inspect:
  include_yml:
    path: caracal/meerkat-inspect.yml
    section: detailed_1gc_plots

Not saying things should literally be implemented like this -- just throwing food for thought out there.

paoloserra commented 3 years ago

I don't know about parameter name matching. Stimela tends to stick to the native parameter names. CARACal does not need to. In fact, we've changed most of the parameter names in order to have some internal CARACal consistency. So I'd say a Stimela-Caracal match is never going to happen.

I fully agree that the CARACal selfcal parameters need a lot of cleaning up -- things need to be more modular.

Overall there's plenty to improve in CARACal, and it's good to think of this for a 2.0 release. But I also think CARACal should target users who know nothing about Stimela rather than Stimela developers. So parameter names, nesting etc etc should not follow Stimela's unless that's what makes more sense for a CARACal user.

o-smirnov commented 3 years ago

But I also think CARACal should target users who know nothing about Stimela rather than Stimela developers. So parameter names, nesting etc etc should not follow Stimela's unless that's what makes more sense for a CARACal user.

Well, it's not about following Stimela naming. Rather, for complicated packages such as WSClean, DDF and CubiCal, I'd like to (ideally) see a system where Stimela names are consistent with the native package parameter names, and there's a way to import this schema into Caracal "as is". Ideally, users who know WSClean should be able to piss out Caracal configs at will without having to look up what we decided to rename each parameter to. If you're going to give the user rope, make it familiar rope.

This does not have to preclude a simple and self-consistent subset of parameters to buttress the Caracal newbie.

SpheMakh commented 3 years ago

I also lean towards keeping the names the same as the packages, but this will be optional. We can have an optional overrides feature


selfcal:
  imager: wsclean
  img_wsclean:
    include_yml:
      path: stimela/wsclean.yml
      section: parameters
    pixel:
      info: Image size in pixels
      overrides: size  # <--------------
      .
      .
      .

SpheMakh commented 3 years ago

Thinking about this a bit more, it would be annoying to have to change standard imaging parameters names like cellsize npixel niter when I decide to use casa or ddfacet instead of wsclean for imaging. So we should definitely have the option of diverging from stimela and native package names.

o-smirnov commented 3 years ago

But that divergence can be accomplished via override and {}-substitutions. E.g.:


selfcal:
  cellsize:
     ...
  img_wsclean:
    include_yml:
      path: stimela/wsclean.yml
      section: parameters
    scale:                                    # "scale" is a wsclean-specific paramater
      default: {selfcal.cellsize}     # ...which we set from a Caracal parameter by default (but the end user can still override)

paoloserra commented 3 years ago

Thinking about this a bit more, it would be annoying to have to change standard imaging parameters names like cellsize npixel niter when I decide to use casa or ddfacet instead of wsclean for imaging. So we should definitely have the option of diverging from stimela and native package names.

Yes, that was one of the reasons.

o-smirnov commented 3 years ago

Actually, I don't think we need to reinvent the wheel here. Both includes and substitutions (of a sort) are already standard YaML features. See merge and anchors, https://yaml.org/type/merge.html.

The somewhat annoying thing is that anchors have to be explicitly set with "&" at point of origin. And it knows nothing about including files -- the YaML code in the file has to be loaded in the stream before. So the example above would have to look something like

selfcal: 
  cellsize: &selfcal_cellsize
     ...
  img_wsclean:
    <<: *stimela__cabs__wsclean__parameters
    scale:                                    
      default: *selfcal_cellsize

(NB, this is cartoon code, it can't actually work like that, but I don't want to cloud the picture with details...)

There may be a way to automatically set anchor names if we use ruamel.yaml, as it has a set_yaml_anchor feature. The loading code would then look something like:

Load all stimela yaml files into a single nested dist, e.g. the contents of wsclean.yml would end up under stimela: cabs: wsclean.
Run a loop to assign anchors based on section names, thus automatically creating anchors such as e.g. stimela__cabs__wsclean__parameters.
Dump the dict to a string yaml buffer.
Load config from buffer + Caracal yaml.

In principle, this is a generic pattern to support a layer of config files Y that can reference sections from a layer of config files X via automatically-generated anchor names. You can keep on adding layers, the only limitation is that each layer can reference auto-generated anchors from the previous layers, but not from within itself.

o-smirnov commented 3 years ago

This sounds complicated, but I think I can make a simple proof-of-concept on am experimental Stimela branch.

I think Stimela itself needs this: there's currently a lot of repetition in the cabs parameters (for example, the entire casa_xxx family has many common parameters) which could be eliminated by defining common stuff in config files under base, and pulling this into config files under cabs using the merge mechanism. In terms of my previous comment, the base configs would be layer X (loaded into a stimela__base namespace), and the cab configs would be layer Y (stimela__cabs). And Caracal configs could then layer on top of that using the same mechanism.

o-smirnov commented 3 years ago

OK, I got tired of fighting Caracal and data and flags, so I spent a day or so hacking on a couple of proofs of concept. Checked it into the configuratt branch.

Approach 1: anchors and aliases

This uses anchors and aliases, as discussed above, plus some extra code to automatically set anchors, and merge YML configs. See here, and run as python stimela/configuratt/demo.py.

I've made example .yml files for the casa base image and for the applycal cab to illustrate.

Pros: the syntax is nice and clean, and is a standard YML feature, e.g.:

inputs:
  <<: *stimela-base-casa-mssel_inputs

  docallib:
    info: "Use callib or traditional cal apply parameters"
    type: bool
    default: false

Cons: the functionality is limited, and it all feels a bit klunky. Also, applycal.yml can't be read on its own by a standard YML parser because it contains these undefined aliases which my code generates auto-magically. It can only be read together with casa.yml by the magic code in demo.py.

Approach 2: use OmegaConf

This is a neat new package, and provides almost all the features we need (plus it's evolving quickly, so will probably have the rest soon). It's a lot more powerful than that: you can specify a schema programmatically using the standard Python typing module, and the resulting schema can be a lot richer than what pykwalify offers. See https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/configuratt/demo_omega.py, it's practically self-documenting.

The >> feature is not quite in it yet, but it has ${a.b.c} interpolation, and also direct command-line support, and lots of other good stuff. See applycl.yaml for a taste (note that the demo files for approach 2 are called yaml as opposed to yml).

Pros: powerful and elegant schemas, explicitly designed for the situation where a config is spread across many files and we want to merge them all together, built-in interpolation (${section.variable}) is very useful. Seems to be developing quickly. Explicit command-line support, we can lose all that horrible parsing code.

Cons: some schema errors have been utterly incomprehensible (if you thought pykwalify was bad...) It is also an option to use omegaconf without the schema mechanism (essentially, it's then a fancy YML reader providing variable interpolation), and then on top of that use something like pykwalify to process and validate the config afterwards.

o-smirnov commented 3 years ago

The >> feature is not quite in it yet,

After a discussion with the developer, I realized this is something relatively simple to bolt on externally, so I checked in a modified demo. This works and looks quite nice now:

inputs:
  _use:
    - base.casa.params.mssel_inputs

  docallib:
    info: "Use callib or traditional cal apply parameters"
    type: bool
    default: false

I think this sort of thing would be a very nice addition to Caracal R2. I'm finding my overall collection of config files to be very repetitive and error-prone. I'd really like to develop a library of standard config snippets, which could be pulled in piecewise, and then only the necessary parameters modified. Something like:

crosscal:
  _use: lib.meerkat.crosscal.4k
  secondary:
    order: KGIAKF

inspect:
  _use: lib.meerkat.plots.cal.phaseballs, lib.meerkat.plots.cal.spectra, lib.meerkat.plots.cal.detailed

  shadems:
    plots: 
      my_extra_plots: 
        - some_extra_plots

P.S. This needs omegaconf master to run, as it uses some features from the upcoming 2.1 release.

gigjozsa commented 3 years ago

@o-smirnov You mention pykwalify vs typing. I was wondering if overloading a parameter in a schema, by which I mean allowing it to be multiple types, e.g.:

None
int
list of ints
list of n strings
list of list of floats

all allowed, but not a float, a list of floats, a list of n+1 or n-1 strings etc.

o-smirnov commented 3 years ago

Good question, but it's turning into a Caracal discussion, so let's discuss in in https://github.com/caracal-pipeline/caracal/discussions/1303 rather.

gigjozsa commented 3 years ago

I am not particularly concerned where a validation takes place, so this is good for me, but in principle this could (should?) happen as early as possible, i.e. Stimela (where the question turned up for me).

SpheMakh commented 3 years ago

I go for approach 2. OmegaConf is exactly what we need

SpheMakh commented 3 years ago

I think this sort of thing would be a very nice addition to Caracal R2. I'm finding my overall collection of config files to be very repetitive and error-prone. I'd really like to develop a library of standard config snippets, which could be pulled in piecewise, and then only the necessary parameters modified. Something like:

Yes, this is the direction we should move. For example, the casa base file I use is below. Let me incorporate the mergeConf shindig to my branch

  vis:
    type: Directory
    info: Name of input visibility file
    default: null
    required: true
    positional: false
    prefix: null
  field:
    info: Select field using field id(s) or field name(s)
    type: str
    default: null 
  spw: 
    info: Select spectral window/channels
    type : str
    default: null
  selectdata:
    info: Other data selection parameters
    type: bool 
    default: true
  timerange:
    info: Select data based on time range
    type: str
    default: null
  uvrange:
    info": Select data within uvrange (default units meters)
    dtype: str 
    default: null
  antenna:
    info: Select data based on antenna/baseline
    type: str
    default: null
  scan:
    info: Scan number range
    type: str 
    default: null
  observation:
    info: Select by observation ID(s)
    type: str
    default: null
  msselect:
    info: Optional complex data selection (ignore for now)
    type: str
    default: null

o-smirnov commented 3 years ago

That's almost word-for-word what I wrote for the omegaconf demo: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/cargo/base/casa/casa.yaml

o-smirnov commented 3 years ago

I go for approach 2. OmegaConf is exactly what we need

OK then let's start a branch and get moving. Setting up the infrastructure is very little work, I think most of it is already in my demo. Porting all the parameters.json files will take some effort, but they'll be so nice and compact when it's done, it's bound to be aesthetically rewarding.

Implementing #692 will also be pretty straightforward then.

SpheMakh commented 3 years ago

Let's continue in your configuratt branch

o-smirnov commented 3 years ago

OK cool, well I moved all the generic config merging infrastructure into the configuratt module, the init now is as simple as it can be: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/configuratt/demo_omega.py#L109.

If you check the schemas and start working on the image and cab ymls, I volunteer to work on #692.

Also, #598 is naturally merged into this...

SpheMakh commented 3 years ago

deal

o-smirnov commented 3 years ago

This is probably better served by a discussion thread: https://github.com/ratt-ru/Stimela/discussions/696

o-smirnov commented 3 years ago

And we should include https://github.com/ratt-ru/Stimela/issues/598 in these deliberations.

ratt-ru / Stimela-classic

cab specification 2.0 #637

Approach 1: anchors and aliases

Approach 2: use OmegaConf