tanbro / pyyaml-include

yaml include other yaml
https://pypi.org/project/pyyaml-include/
GNU General Public License v3.0
78 stars 20 forks source link
pyyaml yaml

pyyaml-include

GitHub tag Python Package Documentation Status PyPI Quality Gate Status

An extending constructor of PyYAML: include other YAML files into current YAML document.

In version 2.0, fsspec was introduced. With it, we can even include files by HTTP, SFTP, S3 ...

⚠️ Warning \ “pyyaml-include” 2.0 is NOT compatible with 1.0

Install

pip install "pyyaml-include"

Because fsspec was introduced to open the including files since v2.0, an installation can be performed like below, if want to open remote files:

🔖 Tip \ “pyyaml-include” depends on fsspec, it will be installed no matter including local or remote files.

Basic usages

Consider we have such YAML files:

├── 0.yml
└── include.d
    ├── 1.yml
    └── 2.yml

To include 1.yml, 2.yml in 0.yml, we shall:

  1. Register a yaml_include.Constructor to PyYAML's loader class, with !inc(or any other tags start with ! character) as it's tag:

    import yaml
    import yaml_include
    
    # add the tag
    yaml.add_constructor("!inc", yaml_include.Constructor(base_dir='/your/conf/dir'))
  2. Use !inc tag(s) in 0.yaml:

    file1: !inc include.d/1.yml
    file2: !inc include.d/2.yml
  3. Load 0.yaml in your Python program

    with open('0.yml') as f:
      data = yaml.full_load(f)
    print(data)

    we'll get:

    {'file1': {'name': '1'}, 'file2': {'name': '2'}}
  4. (optional) the constructor can be unregistered:

    del yaml.Loader.yaml_constructors["!inc"]
    del yaml.UnSafeLoader.yaml_constructors["!inc"]
    del yaml.FullLoader.yaml_constructors["!inc"]

Include in Mapping

If 0.yml was:

file1: !inc include.d/1.yml
file2: !inc include.d/2.yml

We'll get:

file1:
  name: "1"
file2:
  name: "2"

Include in Sequence

If 0.yml was:

files:
  - !inc include.d/1.yml
  - !inc include.d/2.yml

We'll get:

files:
  - name: "1"
  - name: "2"

Advanced usages

Wildcards

File name can contain shell-style wildcards. Data loaded from the file(s) found by wildcards will be set in a sequence.

That is, a list will be returned when including file name contains wildcards. Length of the returned list equals number of matched files:

If 0.yml was:

files: !inc include.d/*.yml

We'll get:

files:
  - name: "1"
  - name: "2"

We support **, ? and [..]. We do not support ^ for pattern negation. The maxdepth option is applied on the first ** found in the path.

Important

  • Using the ** pattern in large directory trees or remote file system (S3, HTTP ...) may consume an inordinate amount of time.
  • There is no method like lazy-load or iteration, all data of found files returned to the YAML doc-tree are fully loaded in memory, large amount of memory may be needed if there were many or big files.

Work with fsspec

In v2.0, we use fsspec to open including files, thus we can include files from many different sources, such as local file system, S3, HTTP, SFTP ...

For example, we can include a file from website in YAML:

conf:
  logging: !inc http://domain/etc/app/conf.d/logging.yml

In such situations, when creating a Constructor constructor, a fsspec filesystem object shall be set to fs argument.

For example, if want to include files from website, we shall:

  1. create a Constructor with a fsspec HTTP filesystem object as it's fs:

    import yaml
    import fsspec
    import yaml_include
    
    http_fs = fsspec.filesystem("http", client_kwargs={"base_url": f"http://{HOST}:{PORT}"})
    
    ctor = yaml_include.Constructor(fs=http_fs, base_dir="/foo/baz")
    yaml.add_constructor("!inc", ctor, yaml.Loader)
  2. then, write a YAML document to include files from http://${HOST}:${PORT}:

    key1: !inc doc1.yml    # relative path to "base_dir"
    key2: !inc ./doc2.yml  # relative path to "base_dir" also
    key3: !inc /doc3.yml   # absolute path, "base_dir" does not affect
    key3: !inc ../doc4.yml # relative path one level upper to "base_dir"
  3. load it with PyYAML:

    yaml.load(yaml_string, yaml.Loader)

Above YAML snippet will be loaded like:

🔖 Tip \ Check fsspec's documentation for more


ℹ️ Note \ If fs argument is omitted, a "file"/"local" fsspec filesystem object will be used automatically. That is to say:

data: !inc: foo/baz.yaml

is equivalent to (if no base_dir was set in Constructor()):

data: !inc: file://foo/baz.yaml

and

yaml.add_constructor("!inc", Constructor())

is equivalent to:

yaml.add_constructor("!inc", Constructor(fs=fsspec.filesystem("file")))

Parameters in YAML

As a callable object, Constructor passes YAML tag parameters to fsspec for more detailed operations.

The first argument is urlpath, it's fixed and must-required, either positional or named. Normally, we put it as a string after the tag(eg: !inc), just like examples above.

However, there are more parameters.

But the format of parameters has multiple cases, and differs variably in different fsspec implementation backends.

Absolute and Relative URL/Path

When the path after include tag (eg: !inc) is not a full protocol/scheme URL and not starts with "/", Constructor tries to join the path with base_dir, which is a argument of Constructor.__init__(). If base_dir is omitted or None, the actually including file path is the path in defined in YAML without a change, and different fsspec filesystem will treat them differently. In local filesystem, it will be cwd.

For remote filesystem, HTTP for example, the base_dir can not be None and usually be set to "/".

Relative path does not support full protocol/scheme URL format, base_dir does not effect for that.

For example, if we register such a Constructor to PyYAML:

import yaml
import fsspec
import yaml_include

yaml.add_constructor(
    "!http-include",
    yaml_include.Constructor(
        fsspec.filesystem("http", client_kwargs={"base_url": f"http://{HOST}:{PORT}"}),
        base_dir="/sub_1/sub_1_1"
    )
)

then, load following YAML:

xyz: !http-include xyz.yml

the actual URL to access is http://$HOST:$PORT/sub_1/sub_1_1/xyz.yml

Flatten sequence object in multiple matched files

Consider we have such a YAML:

items: !include "*.yaml"

If every file matches *.yaml contains a sequence object at the top level in it, what parsed and loaded will be:

items: [
    [item 0 of 1st file, item 1 of 1st file, ... , item n of 1st file, ...],
    [item 0 of 2nd file, item 1 of 2nd file, ... , item n of 2nd file, ...],
    # ....
    [item 0 of nth file, item 1 of nth file, ... , item n of nth file, ...],
    # ...
]

It's a 2-dim array, because YAML content of each matched file is treated as a member of the list(sequence).

But if flatten parameter was set to true, like:

items: !include {urlpath: "*.yaml", flatten: true}

we'll get:

items: [
    item 0 of 1st file, item 1 of 1st file, ... , item n of 1st file,  # ...
    item 0 of 2nd file, item 1 of 2nd file, ... , item n of 2nd file,  # ...
    # ....
    item 0 of n-th file, item 1 of n-th file, ... , item n of n-th file,  # ...
    # ...
]

ℹ️ Note

  • Only available when multiple files were matched.
  • Every matched file should have a Sequence object in its top level, or a TypeError exception may be thrown.

Serialization

When load YAML string with include statement, the including files are parsed into python objects by default. That is, if we call yaml.dump() on the object, what dumped is the parsed python object, and can not serialize the include statement itself.

To serialize the statement, we shall first create an yaml_include.Constructor object whose autoload attribute is False:

import yaml
import yaml_include

ctor = yaml_include.Constructor(autoload=False)

then add both Constructor for Loader and Representer for Dumper:

yaml.add_constructor("!inc", ctor)

rpr = yaml_include.Representer("inc")
yaml.add_representer(yaml_include.Data, rpr)

Now, the including files will not be loaded when call yaml.load(), and yaml_include.Data objects will be placed at the positions where include statements are.

continue above code:

yaml_str = """
- !inc include.d/1.yaml
- !inc include.d/2.yaml
"""

d0 = yaml.load(yaml_str, yaml.Loader)
# Here, "include.d/1.yaml" and "include.d/2.yaml" not be opened or loaded.
# d0 is like:
# [Data(urlpath="include.d/1.yaml"), Data(urlpath="include.d/2.yaml")]

# serialize d0
s = yaml.dump(d0)
print(s)
# ‘s’ will be:
# - !inc 'include.d/1.yaml'
# - !inc 'include.d/2.yaml'

# de-serialization
ctor.autoload = True # re-open auto load
# then load, the file "include.d/1.yaml" and "include.d/2.yaml" will be opened and loaded.
d1 = yaml.load(s, yaml.Loader)

# Or perform a recursive opening / parsing on the object:
d2 = yaml_include.load(d0) # d2 is equal to d1

autoload can be used in a with statement:

ctor = yaml_include.Constructor()
# autoload is True here

with ctor.managed_autoload(False):
    # temporary set autoload to False
    yaml.full_load(YAML_TEXT)
# autoload restore True automatic

Include JSON or TOML

We can include files in different format other than YAML, like JSON or TOML -- custom_loader is for that.

📑 Example \ For example:

import json
import tomllib as toml
import yaml
import yaml_include

# Define loader function
def my_loader(urlpath, file, Loader):
    if urlpath.endswith(".json"):
        return json.load(file)
    if urlpath.endswith(".toml"):
        return toml.load(file)
    return yaml.load(file, Loader)

# Create the include constructor, with the custom loader
ctor = yaml_include.Constructor(custom_loader=my_loader)

# Add the constructor to YAML Loader
yaml.add_constructor("!inc", ctor, yaml.Loader)

# Then, json files will can be loaded by std-lib's json module, and the same to toml files.
s = """
json: !inc "*.json"
toml: !inc "*.toml"
yaml: !inc "*.yaml"
"""

yaml.load(s, yaml.Loader)

Develop

  1. clone the repo:

    git clone https://github.com/tanbro/pyyaml-include.git
    cd pyyaml-include
  2. create then activate a python virtual-env:

    python -m venv .venv
    .venv/bin/activate
  3. install development requirements and the project itself in editable mode:

    pip install -r requirements.txt

Now you can work on it.

Test

read: tests/README.md