python-jsonschema / hypothesis-jsonschema

Tools to generate test data from JSON schemata with Hypothesis
https://pypi.org/project/hypothesis-jsonschema/
Mozilla Public License 2.0
252 stars 29 forks source link

Support recursive references #33

Open Zac-HD opened 4 years ago

Zac-HD commented 4 years ago

Take for example the following schema:

{
    "person": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "parents": {
                "type": "array",
                "maxItems": 2,
                "items": {"$ref": "#/person"}
            }
        }
    },
    "$ref": "#/person"
}

So we need to generate a person, who has 0-2 parents, each of whom is also a person. More complicated situations with several mutually-recursive objects are also possible, of course. Currently such cases will fail a RecursionError as we make no attempt to handle them!

Conceptually these are all easy enough to handle with the st.deferred() strategy... which is the extent of the good news. Some complications:

  1. We want to avoid using st.deferred() when we don't actually need to, as it makes introspection (including e.g. error messages!) much less helpful, and adds some performance overhead (small but rarely needed).
  2. resolve_all_refs() can still resolve all non-recursive references in place, but will need to track which references have been traversed to avoid cycles.
  3. canonicalish() can still discard all non-functional keys, so long as it tracks which keys are reference targets. We can also re-write these to a standard location such as #/definitions/*.
  4. from_schema() can then switch into "deferred mode" if and only if there is a #/definitions part of the schema after canonicalising.

This is all basically managable, but it's also going to be a lot of work that I just don't have capacity to do for free. Pull requests and/or funding welcome, but otherwise expect this issue to stay open.

Stranger6667 commented 4 years ago

Hi! First of all, sorry for not answering your email about references support :( It was quite a long time ago - there were some incompatibilities when I tried it as far as I remember, I'll try to take a look at it again in the near future :)

I noticed, that most of the bugs reported to Schemathesis was about some weird valid schemas, that I just missed when was implementing them - a significant part of tests are example-based. However, being able to generate valid Open API schemas from their JSON Schema definitions could really help to find such edge cases.

So I tried to generate schemas with hypothesis-jsonschema and faced that RecursionError due to recursive references in the Open API JSON Schema definitions.

This is all basically managable, but it's also going to be a lot of work that I just don't have capacity to do for free. Pull requests and/or funding welcome, but otherwise expect this issue to stay open.

I'd like to try to implement it, at least partially, but I am afraid that I'll need some guidance. Can I ask you if it will be possible for you to review a PoC pull request in the next few weeks? I am also open to discuss the funding part as well :)

Cheers

Zac-HD commented 4 years ago

I noticed, that most of the bugs reported to Schemathesis was about some weird valid schemas, that I just missed when was implementing them - a significant part of tests are example-based. However, being able to generate valid Open API schemas from their JSON Schema definitions could really help to find such edge cases.

This is a very familiar idea - I've wanted to generate JSON schemas from the meta-schema to test hypothesis-jsonschema itself basically forever.

My two workarounds were to (1) just write a recursive schema-generating strategy by hand, which is a fair bit of work but very useful, and (2) scrape every schema I could find on the internet into my test suite, notably from the upstream test suite and the "schema store" website. Both of the latter cases have proved invaluable, and I would recommend doing this in addition to any schema generation tricks we can pull off.

Obviously I would still like to support recursive references for self-tests and as a general feature, but it may not be such a severe blocker for your testing :slightly_smiling_face:

I'd like to try to implement it, at least partially, but I am afraid that I'll need some guidance. Can I ask you if it will be possible for you to review a PoC pull request in the next few weeks?

Yes, I'd be very happy to review PRs or experiments leading towards them. As to guidance: I think the steps go something like

  1. make resolve_all_refs() skip recursive refs while resolving all others
  2. make canonicalish() work with schemas that contain references.
    This probably just means recursing into a definitions dict, but requires some care that re-writes don't move reference targets - I think we'll probably want to copy all referenced subschemas to a top-level definitions dict regardless of initial location and update the reference pointers. Reference re-writing may be a new function?
    In the rewrite stage we'll also want to merge in any overlapping schemas, but this is fiddly and can come later.
  3. add a st.deferred() whenever converting a ref to a strategy.
    We know that all remaining references are recursive, so this is pretty simple.

I am also open to discuss the funding part as well :)

Let's do this by email if you find that your proof-of-concept isn't working out, or I exhaust my volunteer time :)

Zac-HD commented 3 years ago

Hey @Stranger6667 - inspired by #68, I'm wondering if we could quickly adapt #61 to just skip resolving recursive schemas. That would mean we only ever generate the base case, but at least it does something valid for them! Thoughts?

sdementen commented 3 years ago

Is there some way to identify the cycle leading to the recursive error in resolve_all_refs()? If so, I may be able to pre-patch the schema to remove these cycles

Stranger6667 commented 3 years ago

@Zac-HD I think it is a viable option, indeed! I will check if it will work and post an update here. At first glance, it should be ok but should include additional handling for #65 as well.

@sdementen Yes, in #61, I implemented it with a map of paths that were already seen during resolving + additional check for the input reference. However, I am not 100% sure that it covers all cases - I need to recheck this.