feat(pipeline): check cluster time in yaml config

rikvl commented 1 year ago

I ran into a small issue with the cluster configuration that may go under the radar:

I tried to specify a job time of 12 hours in a configuration file in the following way (without quotation marks):

cluster:
    time:      12:00:00

PyYAML accepts this without complaints, but parses the value as a sexagimal (base 60) number, giving the following line in the job script:

#SBATCH --time=43200

Slurm then interprets this bare number as minutes, so it results in a job with a time request of 43200 minutes = 720 hours = 30 days, i.e. 60 times more time than I was expecting. In this particular case Cedar did not accept the job, because the time requested is longer than the maximum allowed (28 days). However, if the raw value is 11:12:00 or less Cedar won't complain, and the erroneous time value will lead to jobs inadvertently requesting much more time than needed and maybe having difficulties getting scheduled.

The solution is to make sure such time values are enclosed in quotation marks.

My goal in this PR was to let the pipeline give an error (it could perhaps be a warning instead) when it encounters time values consisting of just digits and colons. Because the check has to be made on the raw configuration string yaml_doc (not on the yaml_params dictionary that is already parsed by PyYAML) I used some ugly regex matching to do the check. Maybe there's a better way.

Either way I do think it's would be good to add some sort of check for such time values. An alternative would be to try to simply educate everyone to put quotes around the time value. A quick check of the example configuration files in ch_pipeline shows there's one with this mistake: https://github.com/chime-experiment/ch_pipeline/blob/master/examples/noise_calibration.yaml

jrs65 commented 1 year ago

@rikvl I definitely see the problem here, but I'm not sure if this is the way we should go about solving it. As far as I can tell, this conversion was deemed to be a mistake in the YAML spec and removed from YAML 1.2. Unfortunately PyYAML only supports YAML 1.1 and so will do this conversion.

I'm a little concerned that your check is just going to be picking up on other things, as for instance it will scan the whole document for a time: tag. There are maybe a few better ways to address this.

Switch to a YAML 1.2 implementing parser such as ruamel, or poyo
Switch a deliberately stricter parser, e.g. StrictYAML.

Just switching to ruamel is probably the easiest. It will also fix the so-called Norway problem too, although that might break some of our older configs.

Anyway, I'm definitely interested to hear your thoughts.

rikvl commented 1 year ago

Right, I should have just created an issue rather than spend my time writing a hacky solution. I wasn't so happy myself with the regex search that scans the entire document for time:

@jrs65 I like your suggestion of switching to a parser that defaults to YAML 1.2 like ruamel, but it could be a bit of an undertaking. Would we want to make that change just in caput or across the entire CHIME codebase? I think PyYAML is also used in ch_pipeline, chimedb, chimedb-dataflag, draco, and driftscan

Depending on the version of ruamel, its API differs a bit from that of PyYAML. Looking at the ruamel documentation's basic usage examples, implementing the latest version would look something like this:

from ruamel.yaml import YAML

yaml = YAML(typ="safe")
yaml.default_flow_style = False

That would hopefully let the .load and .dump methods that we use work without further changes (I didn't try yet). Here and there we also use PyYAML's .safe_load method and the SafeLoader class, which will require further changes.

radiocosmology / caput

feat(pipeline): check cluster time in yaml config #254