Split output in individual .py files

wabiloo commented 1 year ago

Describe the solution you'd like I would like an option to be able to output 1 .py file per model, instead of having them all in the same file. Ideally with then a init.py file that imports them all.

Describe alternatives you've considered Post-generation treatment of the file (somehow) to split it, but that looks complex

Funding

You can sponsor this specific effort via a Polar.sh pledge below
We receive the pledge once the issue is completed & verified

alexpovel commented 1 year ago

Second this.

For example, the GitHub API responses surface is absolutely enormous and very fragmented (repository is one of the core concepts of course, but different endpoints return slightly different versions):

$ python --version
Python 3.11.2
$ datamodel-codegen --version
0.21.4
$ datamodel-codegen --openapi-scope paths --url https://raw.githubusercontent.com/github/rest-api-description/v2.1.0/descriptions/api.github.com/dereferenced/api.github.com.deref.json > api.py
[warnings redacted]
$ wc -l api.py
94887 api.py
$ grep -P 'class Repository\d*\(BaseModel\)' api.py | wc -l
90

So we have Repository all the way through to Repository89. That is sadly too unwieldy. Having this many duplicates or nigh-duplicates would be fine if they'd be namespaced.

One workaround I am in the process of exploring is something along the lines of:

jq '.paths["/orgs/{org}"]["get"]["responses"]["200"]["content"]["application/json"]["schema"]' api.github.com.2022-11-28.deref.json > org.json

For this, download the REST OpenAPI description.

Now org.json contains a workable subset. If we mirror that to the local file system, replacing variables like {org} with, say ORG, we could get a workable solution:

$ datamodel-codegen --input org.json --output src/github/api/orgs/ORG/__init__.py

Now, inside the src Python package, one can issue:

from src.github.api.orgs.ORG import OrganizationFull as Organization

So a workaround can look like this (haven't implemented yet):

Fetch full Open API spec from GitHub (41 MB)
for all paths (view with jq '.paths | keys' api.json), extract its schema (if no get and 200 keys available, skip, I guess, if we're only querying)
save as schema.json to some new path mirroring the API one (which is nice as it makes it easy to match with the docs, and makes it google-able); make it filesystem- and more importantly, Python import-friendly (remove { etc.)
find . -name 'schema.json' -print0 | xargs --null -I '~' datamodel-codegen --input ~ --output parent_dir(~)/__init__.py (parent_dir is pseudo-code, didn't have the patience to figure this out (why is bash so hard here, file paths are its bread and butter...))

Now you got a hierarchical, importable, structured tree of Python modules, with hopefully as little ModelName\d+ hits as possible. Can be regenerated fully at will as well. Symbols can be renamed at will via import X as Y

alexpovel commented 1 year ago

I implemented the above idea. It took more lines than expected, so it lives in a gist:

https://gist.github.com/alexpovel/00ab28e4815a905d4e0407c4932f9988

The module/script docstring explains everything. There's also a usage example shell script. The script is specific to the GitHub OpenAPI specification, but can be adjusted easily, I hope. The script is a hot hack on top of datamodel-codegen and will one day hopefully be obsolete (it doesn't only generate a single-path model file but also fixes a few bugs I happened to come across; your use case might be different and require other fixes, if any).

Pasting both below for convenience.

Main

#!/usr/bin/env python3

# https://peps.python.org/pep-0722/
# Script Dependencies:
#
#    datamodel-code-generator==0.21.1 ; python_version >= "3.10" and python_version < "4.0"
#    black==23.7.0 ; python_version >= "3.10" and python_version < "4.0"
#    pydantic==2.1.0 ; python_version >= "3.10" and python_version < "4.0"

"""From an OpenAPI specification in JSON format, generate a pydantic data model for a
given URL path inside that spec (only).

For example, for a spec containing the `"paths"` key
`/repos/{owner}/{repo}/actions/runs`, it will place the generated pydantic models at
`$OUT_DIR/repos/OWNER/REPO/actions/runs/__init__.py`, ready to be imported. The output
root can be modified via `--out-dir`.

See https://github.com/koxudaxi/datamodel-code-generator/issues/1170 for why this script
can be useful (`datamodel-codegen` generates a *single* file out of a given spec, which
can get unwieldy for large specs).

This script contains special-cased code specific to the GitHub API that you might want
to delete.

Currently only works for GET queries returning HTTP 200 responses on success. Other HTTP
methods and status codes are not supported, but easy to add.
"""

import argparse
import ast
import json
import logging
import re
import typing as t
from functools import partial
from http import HTTPMethod, HTTPStatus
from pathlib import Path, PosixPath

from datamodel_code_generator import (
    DataModelType,
    InputFileType,
    PythonVersion,
    generate,
)

logging.basicConfig(level=logging.INFO)

def sanitize_for_python_import_use(path: PosixPath) -> Path:
    """Sanitize a path for use as a Python import.

    Abusing file system paths here, assuming that any passed URL path is compatible. All
    we want is type-safe splitting at `/` anyway.

    Imported name parts *must* be valid Python identifiers, which this function affords.
    See `dotted_name` of https://docs.python.org/3/reference/grammar.html .
    """

    logging.info(f"Sanitizing path '{path}'")

    assert path.is_absolute(), (
        f"Path '{path}' not absolute, as required by OpenAPI spec"
        + " (https://github.com/OAI/OpenAPI-Specification/blob/9dff244e5708fbe16e768738f4f17cf3fddf4066/schemas/v3.0/schema.json#L793)"
    )

    if str(path) == path.root:  # Terminating base case
        return path

    name = path.name

    def replace(match: re.Match[str]) -> str:
        """Replace a match with its first group, uppercased.

        Useful to not uppercase the *entire* string.
        """
        return match.group(1).upper()

    # Deal with `{foo}` -> `FOO`, common for variable paths in OpenAPI specs
    name = re.sub(r"\{(.*)\}", replace, name)

    # Any remaining non-word characters -> `_`
    name = re.sub(r"[^\w]", "_", name)

    # Leading $DIGITS -> `_$DIGITS`
    name = re.sub(r"^(\d)", r"_\1", name)

    logging.info(f"Sanitized path name to: '{name}'")

    assert name.isidentifier(), f"Name '{name}' not a valid identifier after cleaning"

    return sanitize_for_python_import_use(path.parent) / name

def general_fixup(tree: ast.Module, *, class_renames: dict[str, str]) -> ast.Module:
    """Fixes up code generated by `datamodel-codegen`.

    This should only be necessary for a brief period of time, until the below is fixed
    upstream.

    This uses `ast` which drops comments, whitespace and other formatting. Use LibCST
    (https://libcst.readthedocs.io/en/latest/) for concrete syntax trees able to
    preserve these elements. `ast` was used for simplicity.
    """
    import ast

    from pydantic import Field

    def convert_single_example_value_to_examples_list(tree: ast.Module) -> ast.Module:
        """Converts `example` values to `examples` lists.

        Despite specifying `DataModelType.PydanticV2BaseModel`, `datamodel-codegen` will
        generate calls like `Field(example="hello")`, which in pydantic v1 was allowed
        (as `Any` `kwarg` was):

        https://github.com/pydantic/pydantic/blob/v1.10.12/pydantic/fields.py#L249

        but in pydantic v2 is fixed:

        https://github.com/pydantic/pydantic/blob/v2.1.1/pydantic/fields.py#L672

        aka calls should now look like `Field(examples=["hello"])`.
        """
        for node in ast.walk(tree):
            match node:
                case ast.AnnAssign(
                    annotation=ast.Subscript(
                        slice=ast.Tuple(
                            dims=[
                                _,
                                ast.Call(
                                    func=ast.Name(id=Field.__name__), keywords=keywords
                                ),
                            ]
                        )
                    )
                ):
                    for kw in keywords:
                        match kw:
                            case ast.keyword(arg="example", value=example):
                                kw.arg = "examples"
                                kw.value = ast.List(elts=[example], ctx=ast.Load())
                            case _:
                                pass
                case _:
                    pass

        return tree

    def rename_classes(tree: ast.Module, renames: dict[str, str]) -> ast.Module:
        """Renames classes in the AST.

        `datamodel-codegen` generates class names automatically from the OpenAPI
        description, where they might not be desirable names (`OrganizationFull` instead
        of just `Organization`).
        """

        keys = set(renames.keys())

        for node in ast.walk(tree):
            match node:
                # Class definition itself
                case ast.ClassDef(name=name) if name in renames:
                    node.name = renames[name]
                    keys.remove(name)

                # Class references
                case ast.Name(id=name) if name in renames:
                    # Relevant for cases like:
                    #
                    # ```python
                    # import ast
                    #
                    # print(ast.dump(ast.parse('def x(a: SomeClass): pass'), indent=4))
                    # print(ast.dump(ast.parse('x: list[SomeClass] = [3]'), indent=4))
                    # print(ast.dump(ast.parse('SomeClass(x=3)'), indent=4))
                    # ```
                    node.id = renames[name]
                case _:
                    pass

        if keys:
            logging.error(f"Class renames not applied (name not found) for: {keys}")

        return tree

    tree = convert_single_example_value_to_examples_list(tree)
    tree = rename_classes(tree, class_renames)

    return tree

def github_fixup(tree: ast.Module, *, path: PosixPath) -> ast.Module:
    """GitHub REST API-specific fixes.

    Some ugly, hard-coded special cases. Mainly to iron out what are probably bugs in
    `datamodel-codegen`'s OpenAPI parsing, so this should be temporary.

    For example, a JSON schema entry:

    ```json
    "archived_at": {
        "type": "string", "format": "date-time", "nullable": true
    }

ending up as `archived_at: datetime` (no `None`), leading to validation errors.
"""

from datetime import datetime

from pydantic import EmailStr

applied_fixes = {
    "archived_at_datetime_not_nullable": False,
    "description_string_not_nullable": False,
    "emailstr_to_str": False,
}

for node in ast.walk(tree):
    match node:
        # `path` would be much better as part of the (in that case `tuple`) pattern
        # natively instead of guards but mypy failed then, and type narrowing broke
        # down.

        case ast.AnnAssign(
            target=ast.Name(id="archived_at"),
            annotation=ast.Name(id=datetime.__name__) as annotation,
        ) if path == PosixPath(r"/orgs/{org}"):
            logging.info("Fixing up `archived_at` for node: %s", ast.dump(node))

            node.annotation = ast.BinOp(
                left=annotation,
                op=ast.BitOr(),
                right=ast.Constant(value=None),
            )

            applied_fixes["archived_at_datetime_not_nullable"] = True
        case ast.AnnAssign(
            target=ast.Name(id="description"),
            annotation=ast.Subscript(
                slice=ast.Tuple(elts=[ast.Name(id=str.__name__) as first, *rest]),
            ),
        ):
            # ANY PATH!
            logging.info("Fixing up `description` for node: %s", ast.dump(node))

            # Type-narrow manually, else we're not allowed to reach through the
            # attributes beyond `node.annotation`. `mypy` should be strong enough to
            # do this itself at a future date.
            assert isinstance(node.annotation, ast.Subscript)
            assert isinstance(node.annotation.slice, ast.Tuple)

            # Additionally allowing `None` for the `description` field
            # *unconditionally* is not fatal, as it's simply a more conservative
            # choice, requiring some `None` checks even if the GitHub API would
            # actually never return `None` for that field.
            node.annotation.slice.elts = [
                ast.BinOp(
                    left=first,
                    op=ast.BitOr(),
                    right=ast.Constant(value=None),
                ),
                *rest,
            ]

            applied_fixes["description_string_not_nullable"] = True

        case ast.Name(id=EmailStr.__name__):  # ANY PATH!
            # `pydantic.EmailStr` uses `email-validator`, which (rightfully?)
            # doesn't allow square brackets:
            #
            # https://github.com/JoshData/python-email-validator/blob/5abaa7b4ce6677e5a2217db2e52202a760de3c24/email_validator/rfc_constants.py#L7
            #
            # Let's change *all* these occurrences to `str` for now, as the exact
            # email format isn't that important.
            #
            # Breakage was noticed due to GitHub dependabot commits, where the
            # author email can be, for example:
            #
            # ```text
            # `49699333+dependabot[bot]@users.noreply.github.com`
            # ```
            #
            # Something something https://news.ycombinator.com/item?id=32671959
            node.id = str.__name__

            applied_fixes["emailstr_to_str"] = True

        case _:
            pass

for key, applied in applied_fixes.items():
    if not applied:
        logging.warning(f"Fix `{key}` not applied for path: {path}")

return tree

def black(code: str) -> str: """Format code with black.

`black` doesn't have an API (yet) so this is brittle! See
https://stackoverflow.com/a/76052629/11477374
"""
import black

BLACK_MODE = black.Mode(  # type: ignore[attr-defined]
    target_versions={black.TargetVersion.PY311},  # type: ignore[attr-defined]
    preview=True,  # Get experimental features like string formatting/wrapping
)

try:
    code = black.format_file_contents(code, fast=False, mode=BLACK_MODE)
except black.NothingChanged:  # type: ignore[attr-defined]
    pass
finally:
    if code and code[-1] != "\n":
        code += "\n"

return code

def embellish(code: str, original_spec: dict[t.Any, t.Any]) -> str: """Embellish code with comments and docstrings."""

import sys
from datetime import datetime
from textwrap import dedent

tool_directives = [
    # Add any required, file-scoped linter directives in here.
    #
    # Ignore line length:
    "ruff: noqa: E501",
    # Ignore `pydantic.RootModel` w/o generic args, which is occasionally generated
    # by `datamodel-codegen`:
    'mypy: disable-error-code="type-arg"',
]

header = dedent(
    f"""\
    # File generated by command: {' '.join(sys.argv)}
    #
    # Generated at: {datetime.utcnow().isoformat()}
    #
    # Do not edit manually.
    #
    # Original schema this was generated from attached at bottom of file.
    """
)

header = header + "\n" + "\n".join(f"# {line}" for line in tool_directives) + "\n"

footer = "\n".join(
    f"# {line}" for line in json.dumps(original_spec, indent=2).split("\n")
)

return header + "\n" + code + "\n" + footer

def main() -> None: parser = argparse.ArgumentParser( formatter_class=argparse.ArgumentDefaultsHelpFormatter, description=doc, )

parser.add_argument(
    "spec_file",
    type=Path,
    help="Path to file containing OpenAPI spec."
    + " See for example https://github.com/github/rest-api-description",
)
parser.add_argument(
    "path",
    type=str,
    help=r"URL part from OpenAPI spec to generate code for, e.g. '/orgs/{org}'."
    + " See 'Path Objects' of https://swagger.io/specification/ for details.",
)
parser.add_argument(
    "--out-dir",
    type=Path,
    help="Base directory under which to place generated code."
    + " An OpenAPI path like `/some/path` will be placed at `OUT_DIR/some/path`.",
    default=Path("api"),
)
parser.add_argument(
    "--rename-class",
    help="Rename a class in the generated code. Can be specified multiple times."
    + " Specify as `OldName=NewName`."
    + " Example: `--rename-class Model=SomeMeaningfulName`",
    action="append",
)
parser.add_argument(
    "-f",
    "--force",
    action="store_true",
    help="Overwrite existing files",
)

args = parser.parse_args()

spec_file = Path(args.spec_file)
path = PosixPath(args.path)
out_dir = Path(args.out_dir)
class_renames = (
    {}
    if args.rename_class is None
    else {
        old: new
        for old, new in map(
            partial(str.split, sep="="),
            args.rename_class,
        )
    }
)

for old, new in class_renames.items():
    assert old.isidentifier(), f"Invalid class name: {old}"
    assert new.isidentifier(), f"Invalid class name: {new}"

force = args.force is True  # Don't rely on truthiness

spec = json.loads(spec_file.read_text())

method = str(HTTPMethod.GET).lower()
status = str(HTTPStatus.OK)
# The following will produce helpful and beautiful enough error messages by itself
# from Python 3.11 on
# (https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep657), so just let it
# fail without special key checks.
#
# If the hardcoded `GET`/`200` combo is ever refactored to be more flexible, maybe
# `jq` like query syntax works best ("industry standard" and most flexible for
# users).
schema = spec["paths"][str(path)][method]["responses"][status]["content"][
    "application/json"
]["schema"]
output = sanitize_for_python_import_use(path) / "__init__.py"

assert output.is_absolute()
output = output.relative_to(output.root)  # 'strip' leading slash
output = out_dir / output

if output.exists() and not force:
    raise FileExistsError(
        f"Output '{output}' already exists, refusing to overwrite."
        + " Use --force to overwrite."
    )

output.parent.mkdir(parents=True, exist_ok=True)

generate(
    input_=str(schema),
    input_file_type=InputFileType.JsonSchema,
    output=output,
    output_model_type=DataModelType.PydanticV2BaseModel,
    field_constraints=True,
    use_field_description=False,  # Comments dropped in AST parsing
    use_annotated=True,
    reuse_model=True,
    target_python_version=PythonVersion.PY_311,
    use_double_quotes=True,
    use_standard_collections=True,
    use_union_operator=True,
    wrap_string_literal=True,
)

# Syntax-level fixups
tree = ast.parse(output.read_text())
tree = general_fixup(tree, class_renames=class_renames)
tree = github_fixup(tree, path=path)
code = ast.unparse(tree)

# Raw string code-level fixups
code = black(code)
code = embellish(code, schema)

with open(output, "w") as f:
    f.write(code)

if name == "main": main()


## Usage example

```bash
#!/usr/bin/env bash

set -o errexit
set -o nounset
set -o pipefail

VENV_DIR=$(mktemp -d)

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

echo "Using temporary virtual environment at $VENV_DIR"

setup() {
    command -v curl || { echo "Need to install curl..." && sudo apt update && sudo apt install --yes curl; }

    # Somehow install Python dependencies from `datamodel-codegen-path.py`, which uses
    # PEP 722 which isn't supported anywhere yet... so do it manually for demonstration
    # 🤷

    python3 -m venv "$1"

    "$1"/bin/python3 -m pip install datamodel-code-generator pydantic black
}

setup "$VENV_DIR"

[ -f github.json ] || \
    curl \
        --location \
        --output github.json \
        https://raw.githubusercontent.com/github/rest-api-description/v2.1.0/descriptions/api.github.com/dereferenced/api.github.com.deref.json

"$VENV_DIR"/bin/python3 \
    "${SCRIPT_DIR}/datamodel-codegen-path.py" \
    --force \
    --rename-class 'Model=OrganizationRepository' \
    github.json \
    '/orgs/{org}/repos'

CODE=$(cat <<EOF
from api.orgs.ORG.repos import OrganizationRepository

print(OrganizationRepository)
print("Import successful, your setup worked!")
EOF
)

"$VENV_DIR"/bin/python3 -c "$CODE"

89465127 commented 10 months ago

I am facing this issue as well.

One thing I found interesting is that months (maybe a year ago), I used datamodel-codegen and got a hierarchy of packages and modules. This one took a json file of "openapi": "3.0.0", as it's input.

Now when I use it, I get a monofile, with many duplicate model names, disambiguated by suffix integers. This one took a json file of "swagger": "2.0", as it's input.

I am not sure if this is due to a change in datamodel-codegen, or the difference in input files.

@koxudaxi how does datamodel-codegen decide if it is going to produce a one large monofile or a structure of pacakges & modules?

silviogutierrez commented 6 months ago

@89465127 from my own testing, if schemas have periods/dots in the name foo.bar.Snap, then it requires a folder output and uses separate files. If I replace the periods with underscores or use camelcase, it uses a single output file.

koxudaxi / datamodel-code-generator

Split output in individual .py files #1170

Funding

Main