An extending constructor of PyYAML: include other YAML files into current YAML document.
In version 2.0
, fsspec was introduced. With it, we can even include files by HTTP, SFTP, S3 ...
⚠️ Warning \ “pyyaml-include”
2.0
is NOT compatible with1.0
pip install "pyyaml-include"
Because fsspec was introduced to open the including files since v2.0, an installation can be performed like below, if want to open remote files:
for files on website:
pip install "pyyaml-include" fsspec[http]
for files on S3:
pip install "pyyaml-include" fsspec[s3]
see fsspec's documentation for more
🔖 Tip \ “pyyaml-include” depends on fsspec, it will be installed no matter including local or remote files.
Consider we have such YAML files:
├── 0.yml
└── include.d
├── 1.yml
└── 2.yml
1.yml
's content:
name: "1"
2.yml
's content:
name: "2"
To include 1.yml
, 2.yml
in 0.yml
, we shall:
Register a yaml_include.Constructor
to PyYAML's loader class, with !inc
(or any other tags start with !
character) as it's tag:
import yaml
import yaml_include
# add the tag
yaml.add_constructor("!inc", yaml_include.Constructor(base_dir='/your/conf/dir'))
Use !inc
tag(s) in 0.yaml
:
file1: !inc include.d/1.yml
file2: !inc include.d/2.yml
Load 0.yaml
in your Python program
with open('0.yml') as f:
data = yaml.full_load(f)
print(data)
we'll get:
{'file1': {'name': '1'}, 'file2': {'name': '2'}}
(optional) the constructor can be unregistered:
del yaml.Loader.yaml_constructors["!inc"]
del yaml.UnSafeLoader.yaml_constructors["!inc"]
del yaml.FullLoader.yaml_constructors["!inc"]
If 0.yml
was:
file1: !inc include.d/1.yml
file2: !inc include.d/2.yml
We'll get:
file1:
name: "1"
file2:
name: "2"
If 0.yml
was:
files:
- !inc include.d/1.yml
- !inc include.d/2.yml
We'll get:
files:
- name: "1"
- name: "2"
File name can contain shell-style wildcards. Data loaded from the file(s) found by wildcards will be set in a sequence.
That is, a list will be returned when including file name contains wildcards. Length of the returned list equals number of matched files:
If 0.yml
was:
files: !inc include.d/*.yml
We'll get:
files:
- name: "1"
- name: "2"
We support **
, ?
and [..]
. We do not support ^
for pattern negation.
The maxdepth
option is applied on the first **
found in the path.
❗ Important
- Using the
**
pattern in large directory trees or remote file system (S3, HTTP ...) may consume an inordinate amount of time.- There is no method like lazy-load or iteration, all data of found files returned to the YAML doc-tree are fully loaded in memory, large amount of memory may be needed if there were many or big files.
In v2.0
, we use fsspec to open including files, thus we can include files from many different sources, such as local file system, S3, HTTP, SFTP ...
For example, we can include a file from website in YAML:
conf:
logging: !inc http://domain/etc/app/conf.d/logging.yml
In such situations, when creating a Constructor
constructor, a fsspec filesystem object shall be set to fs
argument.
For example, if want to include files from website, we shall:
create a Constructor
with a fsspec HTTP filesystem object as it's fs
:
import yaml
import fsspec
import yaml_include
http_fs = fsspec.filesystem("http", client_kwargs={"base_url": f"http://{HOST}:{PORT}"})
ctor = yaml_include.Constructor(fs=http_fs, base_dir="/foo/baz")
yaml.add_constructor("!inc", ctor, yaml.Loader)
then, write a YAML document to include files from http://${HOST}:${PORT}
:
key1: !inc doc1.yml # relative path to "base_dir"
key2: !inc ./doc2.yml # relative path to "base_dir" also
key3: !inc /doc3.yml # absolute path, "base_dir" does not affect
key3: !inc ../doc4.yml # relative path one level upper to "base_dir"
load it with PyYAML:
yaml.load(yaml_string, yaml.Loader)
Above YAML snippet will be loaded like:
key1
: pared YAML of http://${HOST}:${PORT}/foo/baz/doc1.yml
key2
: pared YAML of http://${HOST}:${PORT}/foo/baz/doc2.yml
key3
: pared YAML of http://${HOST}:${PORT}/doc3.yml
key4
: pared YAML of http://${HOST}:${PORT}/foo/doc4.yml
🔖 Tip \ Check fsspec's documentation for more
ℹ️ Note \ If
fs
argument is omitted, a"file"
/"local"
fsspec filesystem object will be used automatically. That is to say:data: !inc: foo/baz.yaml
is equivalent to (if no
base_dir
was set inConstructor()
):data: !inc: file://foo/baz.yaml
and
yaml.add_constructor("!inc", Constructor())
is equivalent to:
yaml.add_constructor("!inc", Constructor(fs=fsspec.filesystem("file")))
As a callable object, Constructor
passes YAML tag parameters to fsspec for more detailed operations.
The first argument is urlpath
, it's fixed and must-required, either positional or named.
Normally, we put it as a string after the tag(eg: !inc
), just like examples above.
However, there are more parameters.
in a sequence way, parameters will be passed to python as positional arguments, like *args
in python function. eg:
files: !inc [include.d/**/*.yaml, {maxdepth: 1}, {encoding: utf16}]
in a mapping way, parameters will be passed to python as named arguments, like **kwargs
in python function. eg:
files: !inc {urlpath: /foo/baz.yaml, encoding: utf16}
But the format of parameters has multiple cases, and differs variably in different fsspec implementation backends.
If a scheme/protocol(“http://
”, “sftp://
”, “file://
”, etc.) is defined, and there is no wildcard in urlpath
, Constructor
will invoke fsspec.open
directly to open it. Which means Constructor
's fs
will be ignored, and a new standalone fs
will be created implicitly.
In this situation, urlpath
will be passed to fsspec.open
's first argument, and all other parameters will also be passed to the function.
For example,
files: !inc [file:///foo/baz.yaml, r]
will cause python code like
with fsspec.open("file:///foo/baz.yaml", "r") as f:
yaml.load(f, Loader)
files: !inc {urlpath: file:///foo/baz.yaml, encoding: utf16}
will cause python code like
with fsspec.open("file:///foo/baz.yaml", encoding="utf16") as f:
yaml.load(f, Loader)
If urlpath
has wildcard, and also scheme in it, Constructor
will:
Invoke fsspec's open_files
function to search, open and load files, and return the results in a list. YAML include statement's parameters are passed to open_files
function.
If urlpath
has wildcard, and no scheme in it, Constructor
will:
glob
method to search files,open
method to open each found file(s).urlpath
will be passed as the first argument to both glob
and open
method of the corresponding fsspec implementation backend, and other parameters will also be passed to glob
and open
method as their following arguments.
In the case of wildcards, what need to pay special attention to is that there are two separated parameters after urlpath
, the first is for glob
method, and the second is for open
method. Each of them could be either sequence, mapping or scalar, corresponds single, positional and named argument(s) in python. For example:
.yml
file in directory etc/app
recursively with max depth at 2, and open them in utf-16 codec, we shall write the YAML as below:files: !inc ["etc/app/**/*.yml", {maxdepth: !!int "2"}, {encoding: utf16}]
it will cause python code like:
for file in local_fs.glob("etc/app/**/*.yml", maxdepth=2):
with local_fs.open(file, encoding="utf16") as f:
yaml.load(f, Loader)
maxdepth
is the seconde argument after path
in glob
method, we can also write the YAML like this:files: !inc ["etc/app/**/*.yml", [!!int "2"]]
The parameters for open
is omitted, means no more arguments except urlpath
is passed.
it will cause python code like:
for file in local_fs.glob("etc/app/**/*.yml", 2):
with local_fs.open(file) as f:
yaml.load(f, Loader)
"glob"
and "open"
. for example:files: !inc {urlpath: "etc/app/**/*.yml", glob: [!!int "2"], open: {encoding: utf16}}
❗ Important \ PyYAML sometimes takes scalar parameter of custom constructor as string, we can use a ‘Standard YAML tag’ to ensure non-string data type in the situation.
For example, following YAML snippet may cause an error:
files: !inc ["etc/app/**/*.yml", open: {intParam: 1}]
Because PyYAML treats
{"intParam": 1}
as{"intParam": "1"}
, which makes python code likefs.open(path, intParam="1")
. To prevent this, we shall write the YAML like:files: !inc ["etc/app/**/*.yml", open: {intParam: !!int 1}]
where
!!int
is a ‘Standard YAML tag’ to force integer type ofmaxdepth
argument.ℹ️ Note \
BaseLoader
,SafeLoader
,CBaseLoader
,CSafeLoader
do NOT support ‘Standard YAML tag’.🔖 Tip \
maxdepth
argument of fsspecglob
method is already force converted byConstructor
, no need to write a!!int
tag on it.
Else, Constructor
will invoke corresponding fsspec implementation backend's open
method to open the file, parameters beside urlpath
will be passed to the method.
When the path after include tag (eg: !inc
) is not a full protocol/scheme URL and not starts with "/"
, Constructor
tries to join the path with base_dir
, which is a argument of Constructor.__init__()
.
If base_dir
is omitted or None
, the actually including file path is the path in defined in YAML without a change, and different fsspec filesystem will treat them differently. In local filesystem, it will be cwd
.
For remote filesystem, HTTP
for example, the base_dir
can not be None
and usually be set to "/"
.
Relative path does not support full protocol/scheme URL format, base_dir
does not effect for that.
For example, if we register such a Constructor
to PyYAML:
import yaml
import fsspec
import yaml_include
yaml.add_constructor(
"!http-include",
yaml_include.Constructor(
fsspec.filesystem("http", client_kwargs={"base_url": f"http://{HOST}:{PORT}"}),
base_dir="/sub_1/sub_1_1"
)
)
then, load following YAML:
xyz: !http-include xyz.yml
the actual URL to access is http://$HOST:$PORT/sub_1/sub_1_1/xyz.yml
Consider we have such a YAML:
items: !include "*.yaml"
If every file matches *.yaml
contains a sequence object at the top level in it, what parsed and loaded will be:
items: [
[item 0 of 1st file, item 1 of 1st file, ... , item n of 1st file, ...],
[item 0 of 2nd file, item 1 of 2nd file, ... , item n of 2nd file, ...],
# ....
[item 0 of nth file, item 1 of nth file, ... , item n of nth file, ...],
# ...
]
It's a 2-dim array, because YAML content of each matched file is treated as a member of the list(sequence).
But if flatten
parameter was set to true
, like:
items: !include {urlpath: "*.yaml", flatten: true}
we'll get:
items: [
item 0 of 1st file, item 1 of 1st file, ... , item n of 1st file, # ...
item 0 of 2nd file, item 1 of 2nd file, ... , item n of 2nd file, # ...
# ....
item 0 of n-th file, item 1 of n-th file, ... , item n of n-th file, # ...
# ...
]
ℹ️ Note
- Only available when multiple files were matched.
- Every matched file should have a Sequence object in its top level, or a
TypeError
exception may be thrown.
When load YAML string with include statement, the including files are parsed into python objects by default. That is, if we call yaml.dump()
on the object, what dumped is the parsed python object, and can not serialize the include statement itself.
To serialize the statement, we shall first create an yaml_include.Constructor
object whose autoload
attribute is False
:
import yaml
import yaml_include
ctor = yaml_include.Constructor(autoload=False)
then add both Constructor for Loader and Representer for Dumper:
yaml.add_constructor("!inc", ctor)
rpr = yaml_include.Representer("inc")
yaml.add_representer(yaml_include.Data, rpr)
Now, the including files will not be loaded when call yaml.load()
, and yaml_include.Data
objects will be placed at the positions where include statements are.
continue above code:
yaml_str = """
- !inc include.d/1.yaml
- !inc include.d/2.yaml
"""
d0 = yaml.load(yaml_str, yaml.Loader)
# Here, "include.d/1.yaml" and "include.d/2.yaml" not be opened or loaded.
# d0 is like:
# [Data(urlpath="include.d/1.yaml"), Data(urlpath="include.d/2.yaml")]
# serialize d0
s = yaml.dump(d0)
print(s)
# ‘s’ will be:
# - !inc 'include.d/1.yaml'
# - !inc 'include.d/2.yaml'
# de-serialization
ctor.autoload = True # re-open auto load
# then load, the file "include.d/1.yaml" and "include.d/2.yaml" will be opened and loaded.
d1 = yaml.load(s, yaml.Loader)
# Or perform a recursive opening / parsing on the object:
d2 = yaml_include.load(d0) # d2 is equal to d1
autoload
can be used in a with
statement:
ctor = yaml_include.Constructor()
# autoload is True here
with ctor.managed_autoload(False):
# temporary set autoload to False
yaml.full_load(YAML_TEXT)
# autoload restore True automatic
We can include files in different format other than YAML, like JSON or TOML -- custom_loader
is for that.
📑 Example \ For example:
import json import tomllib as toml import yaml import yaml_include # Define loader function def my_loader(urlpath, file, Loader): if urlpath.endswith(".json"): return json.load(file) if urlpath.endswith(".toml"): return toml.load(file) return yaml.load(file, Loader) # Create the include constructor, with the custom loader ctor = yaml_include.Constructor(custom_loader=my_loader) # Add the constructor to YAML Loader yaml.add_constructor("!inc", ctor, yaml.Loader) # Then, json files will can be loaded by std-lib's json module, and the same to toml files. s = """ json: !inc "*.json" toml: !inc "*.toml" yaml: !inc "*.yaml" """ yaml.load(s, yaml.Loader)
clone the repo:
git clone https://github.com/tanbro/pyyaml-include.git
cd pyyaml-include
create then activate a python virtual-env:
python -m venv .venv
.venv/bin/activate
install development requirements and the project itself in editable mode:
pip install -r requirements.txt
Now you can work on it.
read: tests/README.md