koxudaxi / datamodel-code-generator

Pydantic model and dataclasses.dataclass generator for easy conversion of JSON, OpenAPI, JSON Schema, and YAML data sources.
https://koxudaxi.github.io/datamodel-code-generator/
MIT License
2.66k stars 296 forks source link

Infinite loop in generating models from JSONSchema #986

Open spyoungtech opened 1 year ago

spyoungtech commented 1 year ago

Describe the bug

I am trying to produce pydantic models from a JSONSchema file I have. When I try to do this, the process never finishes and just accumulates memory without end. I let it run for a while and it ended up taking up 8+GB of memory. The schema itself is a good handful of megabytes with probably over 10,000 discrete components, which could be a problem, but I believe it should stop eventually.

Eventually this stack trace is produced with a RecursionError:

``` Traceback (most recent call last): File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\__main__.py", line 626, in main generate( File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\__init__.py", line 384, in generate results = parser.parse() File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\base.py", line 475, in parse self.parse_raw() File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 1270, in parse_raw self._resolve_unparsed_json_pointer() File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 1294, in _resolve_unparsed_json_pointer self._resolve_unparsed_json_pointer() File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 1294, in _resolve_unparsed_json_pointer self._resolve_unparsed_json_pointer() File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 1294, in _resolve_unparsed_json_pointer self._resolve_unparsed_json_pointer() [Previous line repeated 957 more times] File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 1290, in _resolve_unparsed_json_pointer self.parse_json_pointer(self.raw_obj, reserved_ref, path_parts) File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 1306, in parse_json_pointer self.parse_raw_obj( File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 1213, in parse_raw_obj self.parse_obj(name, JsonSchemaObject.parse_obj(raw), path) File "pydantic\main.py", line 526, in pydantic.main.BaseModel.parse_obj File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 208, in __init__ super().__init__(**data) File "pydantic\main.py", line 340, in pydantic.main.BaseModel.__init__ File "pydantic\main.py", line 1076, in pydantic.main.validate_model File "pydantic\fields.py", line 886, in pydantic.fields.ModelField.validate File "pydantic\fields.py", line 1021, in pydantic.fields.ModelField._validate_mapping_like File "pydantic\fields.py", line 1094, in pydantic.fields.ModelField._validate_singleton File "pydantic\fields.py", line 884, in pydantic.fields.ModelField.validate File "pydantic\fields.py", line 1094, in pydantic.fields.ModelField._validate_singleton File "pydantic\fields.py", line 884, in pydantic.fields.ModelField.validate File "pydantic\fields.py", line 1101, in pydantic.fields.ModelField._validate_singleton File "pydantic\fields.py", line 1148, in pydantic.fields.ModelField._apply_validators File "pydantic\class_validators.py", line 318, in pydantic.class_validators._generic_validator_basic.lambda13 File "pydantic\main.py", line 711, in pydantic.main.BaseModel.validate File "C:\Users\Spencer\repos\redacted\venv\lib\site-packages\datamodel_code_generator\parser\jsonschema.py", line 208, in __init__ super().__init__(**data) File "pydantic\main.py", line 340, in pydantic.main.BaseModel.__init__ File "pydantic\main.py", line 1076, in pydantic.main.validate_model File "pydantic\fields.py", line 884, in pydantic.fields.ModelField.validate File "pydantic\fields.py", line 1101, in pydantic.fields.ModelField._validate_singleton File "pydantic\fields.py", line 1148, in pydantic.fields.ModelField._apply_validators File "pydantic\class_validators.py", line 318, in pydantic.class_validators._generic_validator_basic.lambda13 File "pydantic\validators.py", line 61, in pydantic.validators.str_validator RecursionError: maximum recursion depth exceeded while calling a Python object ```

To Reproduce

The schema is too large to put in the issue, but it can be found in this gist.

Used commandline:

$ datamodel-codegen  --input problem.json --input-file-type jsonschema --output model.py

Expected behavior

The expectation is that the model generation eventually completes.

Version:

Additional context

The schema itself was created by dynamically generating pydantic models and dumping model.json_schema(). Not sure if that's relevant, but in my mind I guess it's not out of the realm of possibility that this could matter.

lhmwtum commented 1 year ago

Hey @spyoungtech, @koxudaxi

I'm also experiencing this issue of being stuck in an infinite loop when generating a datamodel out of a json schema (stuck in function: datamodel_code_generator.generate()) System: ubuntu: 22.04 python: 3.10 datamodel-code-generator: v0.17.2

I've tried to parse the OpenLabel schema json downloaded here.

The module works as expected until v0.17.1. The issues seems to be introduced with v0.17.2. Note that this is different to the comment above where the problematic version seems to be already v0.14.1

koxudaxi commented 10 months ago

@lhmwtum v0.17.2 is too old. Can you try the latest version 0.25.1?

benedikt-bartscher commented 9 months ago

Hey @koxudaxi, thanks for this awesome generator! I just tried to generate model for authentik blueprints schema and experienced the same RecursionError using the latest version 0.25.1 of datamodel-codegen.

benedikt-bartscher commented 8 months ago

If i remove the line

    "$id": "https://goauthentik.io/blueprints/schema.json",

from the schema the recursion error is gone

lhmwtum commented 6 months ago

Deleting the "$id" field also "solves" the problem with my JSON schema.

I've tried to figure out what's wrong and it seems that the comparison of this if statement is always True. Therefore, the same function is called over and over again resulting in the described infinite loop. I've also watched the size of self.results and it grows infinitely.

@koxudaxi do you have any idea why this happens? Could you please provide some details or hints in order to understand what's happening here and why this check is necessary? Thank you :)

https://github.com/koxudaxi/datamodel-code-generator/blob/main/datamodel_code_generator/parser/jsonschema.py#L1728

EDIT: I also found out that the reserved_refs variable looks different with/without the id. The id is put at the beginning of each ref. When running the code without the id, reference.loaded == True, with id this variable is False. Furthermore, there is a different number of reserved_refs in both cases.

image

image

lhmwtum commented 6 months ago

I've figured out that the path composition of the id field makes the difference. When there are more than two parts, the code ends up in an infinite loop.

Examples

These paths make the code fail: "$id": "https://openlabel.asam.net/V1-0-0/schema#" "$id": "https://goauthentik.io/blueprints/schema.json" "$id": "https://dummy/test/path" "$id": "https://dummy/test/test2/path"

These paths work: "$id": "https://openlabel.asam.net/schema#" "$id": "https://goauthentik.io/schema.json" "$id": "https://dummy/path"

Can you confirm that @benedikt-bartscher ?

benedikt-bartscher commented 6 months ago

@lhmwtum thanks for investigating. I just tested the paths you provided and I can confirm the behavior.

koxudaxi commented 6 months ago

Sorry for the late reply everyone. And thank you for your detailed research. From what I have read of your findings, it seems that the Path resolution is not working.

I've figured out that the path composition of the id field makes the difference. when there are more than two parts, the code ends up in an infinite loop.

Thank you. We will try to add more than 3 hierarchical paths to the test case.

koxudaxi commented 6 months ago

~~I guess the unittest is broken. The request url doesn't have the definitions prefix. https://github.com/koxudaxi/datamodel-code-generator/blob/b3fbbcade9814d4080098ae61ba69e6f8dd018f5/tests/test_main.py#L3090-L3153~~

koxudaxi commented 6 months ago

@benedikt-bartscher @lhmwtum How do you reproduce the error?

https://github.com/koxudaxi/datamodel-code-generator/issues/986#issuecomment-1878607726 I saved the schema in this post as --url or wget and ran it with --input and the file was created without error. However, when I ran it like cat blueprints.json | datamodel-codegen I got stuck in an infinite loop. :thinking:

benedikt-bartscher commented 6 months ago

Hi @koxudaxi I am currently using a small python script, like this:

import json
import logging
from pathlib import Path

import requests
from datamodel_code_generator import DataModelType, InputFileType, generate
from datamodel_code_generator.format import PythonVersion

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# renovate: datasource=github-tags depName=goauthentik/authentik extractVersion=version/(?<version>.*)
authentik_version = "2024.2.2"

url = f"https://raw.githubusercontent.com/goauthentik/authentik/version/{authentik_version}/blueprints/schema.json"

logger.info(f"Fetching schema from {url}")
schema = requests.get(url).text
logger.info("Fetching schema done")

#  logger.info("Loading schema from file")
#  with open("schema.json", "r") as f:
#      schema = f.read()

# load schema in dict
schemadict = json.loads(schema)
del schemadict["$id"]

#  # write schema to file
#  logger.info("Writing modified schema to file")
#  with open("schema.json", "w") as f:
#      f.write(json.dumps(schemadict, indent=4))

logger.info("Loading modified schema back to string")
schema = json.dumps(schemadict)

aliases = {
    "resource": "resource_",
}

outpath = Path("src/authentik_blueprints")
logger.info(f"Creating output directory {outpath}")
outpath.mkdir(parents=True, exist_ok=True)

logger.info(f"Start generating models")
generate(
    schema,
    aliases=aliases,
    #  reuse_model=True,
    input_file_type=InputFileType.JsonSchema,
    target_python_version=PythonVersion.PY_312,
    output=outpath,
    output_model_type=DataModelType.PydanticV2BaseModel,
    use_default_kwarg=True,
    # modern python
    use_union_operator=True,
    use_standard_collections=True,
    use_generic_container_types=True,
    use_annotated=True,
    field_constraints=True,
)
lhmwtum commented 6 months ago

I have the schema stored locally as a json file. This is my code:

import json
from pathlib import Path
from datamodel_code_generator import InputFileType, generate

filename_openlabel_json_schema = "openlabel_json_schema_v1-0-0.json"

# get absolute path to repository
abspath_repo = Path(__file__).parent
abspath_json_schema = (abspath_repo /  filename_openlabel_json_schema)
abspath_output = abspath_repo / "openlabel_annotation_schema.py"

# Load OpenLABEL JSON schema file
with open(abspath_json_schema) as fp:
    json_schema = json.load(fp)

generate(
    str(json_schema),
    input_filename=filename_openlabel_json_schema,
    input_file_type=InputFileType.JsonSchema,
    reuse_model=True,
    output=abspath_output,
    # NOTE: set to False to suppress auto-generated doc strings which do not
    # meet pep257 standards.
    use_schema_description=False,
    class_name="OpenLabelAnnotationSchema",
)