hpc4cmb / toast

Time Ordered Astrophysics Scalable Tools
Other
44 stars 39 forks source link

Parameterized construction of operators #340

Closed tskisner closed 3 years ago

tskisner commented 4 years ago

This issue is an updated discussion thread of the older issue #202. We want an easy way of constructing operators directly from a dictionary of parameters. This has several parts:

  1. Some code external to the operators can parse options from a variety of sources. For example, a TOML config file could set some defaults that were then overridden by commandline arguments.

  2. The operators themselves should have a method to return their accepted parameters (with the type / help string of each) and should internally parse and validate these options.

With these changes, the code which defines the options "lives with" the operator code itself. Much of the pipeline_tools subpackage would migrate into each operator. There would still be helper functions in pipeline_tools to do things like load / dump config files, merge config files and argparse options, build a list of accepted options given a list of operator classes that will be used, etc.

zonca commented 4 years ago

I have something similar in mapsims, you could review that approach for reference. An example parameter file is in: https://github.com/simonsobs/mapsims/blob/master/mapsims/data/example_config_v0.2.toml

The approach is very simple, each section of the config file, for example cmb, has a class attribute, in this case mapsims.SOPrecomputedCMB, so that class is instantiated and all other lines of the cmb section passed over as keyword arguments to the class constructor.

See the code https://github.com/simonsobs/mapsims/blob/master/mapsims/runner.py#L110

I implemented command line overriding, manually, just for the most useful parameters, i.e. NSIDE and channels. I prefer command line overriding to be limited, just for splitting jobs, and have everything in config files. It think it would be complicated and fragile to implement, not worth the risk.

tskisner commented 4 years ago

Thanks @zonca, I like your approach. I have been thinking about this more lately. Combining this functionality with the data model changes in #334 and the operator changes in #339 will get us closer to the usability goal I have in mind. Basically I want interactive use to look like this:

import toast
from myexperiment import DataLoader

# Data load / dump is an operator now. Input data can be None
loader = DataLoader(options...)
data = loader.exec(data=None)

print(data)
>>> [ <toast.Observation name=blah dets=[d1, d2, d3]>, <toast.Observation name=foo ...>...]

# The default "signal"
print(data[0].signal("d1"))
>>> [5.6, 6.7, 7.8, ...]

# Some other timestream flavor
print(data[0]["filtered"]["d1"])
>>> [-3.0, -4.0, -5.0, ...]

# NOTE:  Ignoring things like pointing expansion, filtering, etc here...

from toast import OpMapMaker as mapmaker

# What are the default options for this operator?  This is a class method, 
# not an instance method.
opts = mapmaker.defaults()
print(opts)
>>> {"precond-width": 100, "baseline-length": 10000.0, "nside": 512, "binmap": False, ...}

# Change some options from the defaults
opts["nside"] = 1024
opts["binmap"] = True

# Single channel maps from alternating observations
mapper = mapmaker(**opts)
for det in ["d1", "d2", "d3"]:
    mapper.update(data=data, dets=[det], obs=["blah1", "blah3", "blah5"...])
mapper.finalize()

# Existing method is a shortcut for running all detectors, all observations
mapper = mapmaker(**opts)
mapper.exec(data)

# Now I want to manually get config from my friend instead:
config = toml.load(config_file)
mapmaker_config = config["awesome-map-config"]
mapper = mapmaker(**mapmaker_config)
mapper.exec(data)

For non-interactive use in a pipeline script, I really like your from_config() concept. Basically it returns a dictionary of instantiated classes. So something like:

# Load config file(s)
config = dict()
for conf in config_files:
    config.update(toml.load(conf))

# Instantiate operators
ops = toast.from_config(config)

# Process detector sets one at a time
loader = ops["loader"]
pointing = ops["pointing"]
mapper = ops["mapper"]
for dets in [["d1", "d2"], ["d3", "d4"], ...]:
    # Only read these detectors into memory
    data = ops["loader"].update(data=None, dets=dets)
    # Expand pointing
    pointing.update(data=data, dets=dets)
    # Map them
    mapper.update(data=data, dets=dets)
mapper.finalize()
pointing.finalize()
loader.finalize()

Obviously this is just brainstorming, but this is the kind of simplicity we need to get to. And we need to have the "default" set of config values (and perhaps their allowed types and help strings) living with the class they are associated with, rather than in the external pipeline_tools package.

zonca commented 4 years ago

I think name of methods should be verbs, so get_signal instead of signal, also defaults.

then can we do without finalize and assume classes are immutable once initialized?

also the update method is unclear, what does it update? it seems to me it should be exec or run maybe?

tskisner commented 4 years ago

Ok, I like your idea of using verbs for the names. The "update" name was just me trying to use another word since "exec" was taken :-)

Basically the model I had was:

  1. Class is constructed with options that are fixed for the life of the class instance.

  2. Class "does something" (update()) with a subset of the data, cutting on observation and detectors. Actually we could move the "observation cut" to a separate function, since it is cheap to make a new data object with just some of the observations. The update() (or whatever name) might accumulate some global quantities using that subset of detectors. For example, a noise-weighted map, inverse noise covariance, timestream statistics, etc.

  3. The finalize() method was intended to indicate that no other calls will be made to update(), so that the class could safely do things with the global quantities.

  4. The exec() method would stay backwards compatible, and call update(dets=all) and then finalize().

Since not all functions need a finalize() functionality, we could just skip it and certain operators could implement a custom function (with any name) that would "do something" with the global quantities at the end.

We could also probably modify the exec() API to be backwards compatible (use all dets by default).

Ok, I think I am moving towards your way of thinking on this. Do you have any opinion about using the exec() method as a dedicated function versus defining __call__() for the operators, so that they are just used as functions?

zonca commented 4 years ago

exec is fine, using __call__ is often confusing. I would call the method process instead of update, it is processing a bunch of data for some detectors. also, all editors have automatic completion nowadays, I think we should use long names, so detectors and not dets, which to me reminds determinants for some reason.

tskisner commented 3 years ago

All of these interfaces are now in place. Operators are configured with traits from the traitlets package and can be automatically instantiated from command line options and / or config files in TOML and JSON format.