Open rikvl opened 1 year ago
@rikvl I definitely see the problem here, but I'm not sure if this is the way we should go about solving it. As far as I can tell, this conversion was deemed to be a mistake in the YAML spec and removed from YAML 1.2. Unfortunately PyYAML only supports YAML 1.1 and so will do this conversion.
I'm a little concerned that your check is just going to be picking up on other things, as for instance it will scan the whole document for a time:
tag. There are maybe a few better ways to address this.
Just switching to ruamel is probably the easiest. It will also fix the so-called Norway problem too, although that might break some of our older configs.
Anyway, I'm definitely interested to hear your thoughts.
Right, I should have just created an issue rather than spend my time writing a hacky solution. I wasn't so happy myself with the regex search that scans the entire document for time:
@jrs65 I like your suggestion of switching to a parser that defaults to YAML 1.2 like ruamel, but it could be a bit of an undertaking. Would we want to make that change just in caput or across the entire CHIME codebase? I think PyYAML is also used in ch_pipeline, chimedb, chimedb-dataflag, draco, and driftscan
Depending on the version of ruamel, its API differs a bit from that of PyYAML. Looking at the ruamel documentation's basic usage examples, implementing the latest version would look something like this:
from ruamel.yaml import YAML
yaml = YAML(typ="safe")
yaml.default_flow_style = False
That would hopefully let the .load
and .dump
methods that we use work without further changes (I didn't try yet). Here and there we also use PyYAML's .safe_load
method and the SafeLoader
class, which will require further changes.
I ran into a small issue with the cluster configuration that may go under the radar:
I tried to specify a job time of 12 hours in a configuration file in the following way (without quotation marks):
PyYAML accepts this without complaints, but parses the value as a sexagimal (base 60) number, giving the following line in the job script:
Slurm then interprets this bare number as minutes, so it results in a job with a time request of 43200 minutes = 720 hours = 30 days, i.e. 60 times more time than I was expecting. In this particular case Cedar did not accept the job, because the time requested is longer than the maximum allowed (28 days). However, if the raw value is 11:12:00 or less Cedar won't complain, and the erroneous time value will lead to jobs inadvertently requesting much more time than needed and maybe having difficulties getting scheduled.
The solution is to make sure such time values are enclosed in quotation marks.
My goal in this PR was to let the pipeline give an error (it could perhaps be a warning instead) when it encounters time values consisting of just digits and colons. Because the check has to be made on the raw configuration string
yaml_doc
(not on theyaml_params
dictionary that is already parsed by PyYAML) I used some ugly regex matching to do the check. Maybe there's a better way.Either way I do think it's would be good to add some sort of check for such time values. An alternative would be to try to simply educate everyone to put quotes around the time value. A quick check of the example configuration files in ch_pipeline shows there's one with this mistake: https://github.com/chime-experiment/ch_pipeline/blob/master/examples/noise_calibration.yaml