Separate modified models from generated models

As of e06b6cd8933807348fd933c8d85b47fac547c114, auto-generated Harbor API models are located in harborapi.models.models and harborapi.models.scanner. In these modules, we have made manual modifications to the model definitions to fix errors in the spec and/or add new functionality to the models. This has a major drawback; updating the models from updated spec definitions is a cumbersome and manual process, as the modifications are overwritten by the new auto-generated definitions.

Changing the models directly makes it very difficult to generate new models definitions whenever the Harbor API spec and the Scanner API spec are updated. This is because any new definitions will overwrite the changes we have made, and thus every update would have to be manually incorporated into the models.models and models.scanner modules.

Changes

To remedy the problem outlined above, this pull request puts the auto generated models into their own modules (harborapi.models._models and harborapi.models._scanner), and modifies them in harborapi.models.models and harborapi.models.scanner respectively. These modules re-export the models from _models and _scanner, thus maintaining backwards compatibility by keeping the API for users unchanged.

Auto generating new models going forward

Auto-generating models only requires running just genapi and just genscanner. Since the definitions and modifications/overrides are stored separately, generating new models should have no effect on the overrides - unless the models are changed by the update in a way that is not compatible with the current overrides. Re-run tests each time models are generated to verify that the new definitions are compatible with the application, and makes changes accordingly.

Tests

New tests have been added to ensure that the modified models are compatible with the generated models. The tests check the modified fields for deviations between the generated models and the modified models. For now, these tests are quite specific and strict, wherein the modified fields are manually specified. This is potentially not optimal, and we could look into adding some sort of _overrides field to modified models, so we can automate the testing process:

https://github.com/pederhan/harborapi/blob/f6bfa6f981c2d6f410d379d49c7e8d76f813ec29/tests/models/test_scanner.py#L26-L33

We manually specify the modified fields in the tests, and which field attributes we expect to diverge (if any). Default field value is not compared, since that is the most likely override.

In pull request

class VulnerabilityItem(_VulnerabilityItem):
    severity: Severity = override_field(
        _VulnerabilityItem,
        "severity",
        default=Severity.unknown,
        description="The severity of the vulnerability.",
        example=Severity.high.value,
    ) 
    links: Optional[List[str]] = optional_field(_VulnerabilityItem, "links")
# ...

In the future

class VulnerabilityItem(_VulnerabilityItem):
    _overrides = {
        "severity": {
            "default": Severity.unknown,
            "description": "The severity of the vulnerability.",
            "example": Severity.high.value,
        },
        "links": {"type": Optional[List[str]]},
    }
    # ...

The testing regime will be outlined in a future pull request adding a development guide.

Somewhat big problem: model overrides don't affect generated models that reference them. The generated models will still reference the original (generated) model, not the override. In order to fix this, we override the models that reference the model we have overridden.

https://github.com/pederhan/harborapi/blob/cb24b4349d4aaf8a1d081b01b6a3a71da05913bb/harborapi/models/models.py#L346-L371

This is not ideal, and introduces a lot of room for human error. Certain tests are added to ensure models behave as expected, but tests are also written manually, and they, too, can miss important changes.

Good documentation NOW

All these overrides really make it necessary to document these models clearly in the documentation. We can't expect users to jump through the whole class hierarchy to get a decent idea of how the model is structured.

To that end, we need to find out how we can use mkdocstrings or something similar to generate documentation from Pydantic models, because as it stands, it's very messy.

A new problem has come to light: datamodel-code-generator has introduced a breaking change some time between version 0.13.0 and 0.17.0.

Models such as ExtraAttrs are generated as models with a single field __root__: T instead of an empty model with extra = "allow". This is technically a more correct approach, as it attempts to follow the spec more closely, but is actually a huge problem in our case.

Specifically ExtraAttrs, and probably others too, is specified as following in swagger.yaml:

ExtraAttrs:
  type: object
  additionalProperties:
    type: object

Which in datamodel-code-generator 0.13.0 generates the following model:

class ExtraAttrs(BaseModel):
    pass

    class Config:
        extra = Extra.allow

But in 0.17.0 generates this model:

class ExtraAttrs:
    __root__: Optional[Dict[str, Dict[str, Any]]] = None

The latter (0.17.0) is technically more correct when you take into consideration the spec, but the only problem is: THE SPEC IS WRONG!

Here is an example of what the API returns from GET /projects/library/repositories/ubuntu/artifacts/latest:

{
    "id": 2,
    "type": "IMAGE",
    "media_type": "application/vnd.docker.container.image.v1+json",
    "manifest_media_type": "application/vnd.docker.distribution.manifest.v2+json",
    "project_id": 1,
    "repository_id": 2,
    "digest": "sha256:965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea",
    "size": 30430700,
    "icon": "sha256:0048162a053eef4d4ce3fe7518615bef084403614f8bca43b40ae2e762e11e06",
    "push_time": "2023-02-05T12:20:35.997000+00:00",
    "pull_time": "2023-02-06T10:50:38.829000+00:00",
    "extra_attrs": {
        "config": {
            "Cmd": [
                "bash"
            ],
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ]
        },
        "created": "2022-12-09T01:20:31.321639501Z",
        "os": "linux",
        "author": "",
        "architecture": "amd64"
    },
    "annotations": null,
    "references": null,
    "tags": [
        {
            "id": 2,
            "repository_id": 2,
            "artifact_id": 2,
            "name": "latest",
            "push_time": "2023-02-05T12:20:36.013000+00:00",
            "pull_time": "2023-02-05T17:38:27.844000+00:00",
            "immutable": false,
            "signed": false
        }
    ],
    "addition_links": {
        "build_history": {
            "absolute": false,
            "href": "/api/v2.0/projects/library/repositories/ubuntu/artifacts/sha256:965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea/additions/build_history"
        },
        "vulnerabilities": {
            "absolute": false,
            "href": "/api/v2.0/projects/library/repositories/ubuntu/artifacts/sha256:965fbcae990b0467ed5657caceaec165018ef44a4d2d46c7cdea80a9dff0d1ea/additions/vulnerabilities"
        }
    },
    "labels": null,
    "scan_overview": null,
    "accessories": null
}

Notice how extra_attrs is NOT a dict of dicts, but rather a dict of Any. We observe the values to be both dicts and strings. As such, the definition generated by 0.13.0 is more correct, as it doesn't make incorrect assumptions about the shape of the data. I need to read up on OpenAPI/Swagger to understand if this is a problem with the spec or with datamodel-code-generator, but either way, for now we will have to continue using 0.13.0

Another thing to note about __root__ models is that they are going away in Pydantic V2, so we should steer clear of these as much as we can. Hopefully datamodel-code-generator introduces something to avoid generating root models altogether in the future.

unioslo / harborapi