Non-deterministic schema generation for `Optional[Union[...]]` types

lgo commented 1 month ago

Describe the bug

I've been hitting some odd non-determinism for some complex schemas. I haven't quite narrowed it down or reached a reproduction as it also seems to be dependent on some import load, e.g. some code in one module loading schemas sees one order (str, null, int, float, ...) and code in another module run separately sees (str, int, float, ..., null).

This is resulting in some non-determinism across code that writes schemas, and code that checks for stale schemas.

Here's the specific schema field I'm dealing with in case the surrounding bits are also relevant, although I believe the problem is specifically with just the Optional[Union[...]] with several sub-values in particular.

from dataclasses_avroschema import AvroModel

class Foo(AvroModel):
  value: Optional[
        Dict[str, Optional[Union[str, int, float, bool, Dict[str, Optional[str]]]]]
    ]

I'll give a go at trying to reproduce this more reliabily later, but at the moment I've found swapping this to Union[None, ...] to force the ordering to be consistent.

To Reproduce Steps to reproduce the behavior

Expected behavior A clear and concise description of what you expected to happen.

lgo commented 1 month ago

Seems I'm being bested by this! Even after swapping Optional[Union[...]] to Union[None, ...], when I run this via one script (locally) I'm getting the null showing up first, but when it's run in another environment (CI) in a test it's generating it with the null in the end of a type list.

In a few cases, I've found setting Union[..., None] to work fine but in others the None is placed in the second type position.

marcosschroh commented 1 month ago

Which Python version are you using locally and in the CI? Are you installing estará packages during the CI that you do not have locally?

lgo commented 1 month ago

This is with Python 3.11.8 and notably dataclasses-avroschema 0.60.0 (we cannot bump it further because the required python-dateutil conflicts with requirements among other libraries we use). The Python version and installed libraries should be the same as it's all through some standard tooling at my workplace, but I'll report back with differences if that's not the case.

lgo commented 1 month ago

Okay, I've managed to now reproduce the issue within the same environment, locally. This is where:

Our script writing Avro JSON, produces a result that has the null placed last in the values type descriptor for a map
Our test that regenerates Avro JSON and compares against the writen files (for staleness checks), produces a result that places the null in the 2nd position

lgo commented 1 month ago

Comically, I thought I could be clever and get away with forcing the null to the start or end in some post-processing, but it turns out that violates the Avro spec - https://avro.apache.org/docs/1.11.1/specification/#unions

(Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing “null”, the “null” is usually listed first, since the default value of such unions is typically null.)

But even more annoyingly, avro seems to not validate the default value type correctness at all https://issues.apache.org/jira/browse/AVRO-3272

and also fastavro, used by this library, does validate the default value type is a valid type but it does not check that it is the first value, so it also let this slide. (https://github.com/fastavro/fastavro/issues/785)

lgo commented 1 month ago

Okay, I think I see where this is going wrong. At this point of the UnionField handling, we call get_args for the types. https://github.com/marcosschroh/dataclasses-avroschema/blob/52ac9b46e047a63aa43806f3146572b91b9d36fa/dataclasses_avroschema/fields/fields.py#L313

In the Python docs for typing.get_args it has this releveant tidbit:

If X is a union or Literal contained in another generic type, the order of (Y, Z, ...) may be different from the order of the original arguments [Y, Z, ...] due to type caching. Return () for unsupported objects.

This is happening specifically for the Union within Dict type, which matches what this document outlines.

I suspect if we simply apply some sort of deterministic sorting to the types coming into that tuple, we would be able to ensure this doesn't get jumbled between runs. The subsequent code to ensure the default is at the front would still function fine regardless of the order coming in. https://github.com/marcosschroh/dataclasses-avroschema/blob/52ac9b46e047a63aa43806f3146572b91b9d36fa/dataclasses_avroschema/fields/fields.py#L318-L331

lgo commented 1 month ago

I'll work on a PR for ^

While I'm here, any chance you'd also be able to loosen some dependency version requirements? It'd be great to shift something like ^2.9.0.post0 back which I suspect is higher than required and is a pain to work around (we've have to vend this internally)

lgo commented 1 month ago

Python doesn't seem to have good facilities or sorting types, so this would require a custom comparison fucntion for Python types. I'm not too keen on that given the complexity but it would accomplish the goals. What do you think @marcosschroh?

I've prepped some tests here which more-or-less exhibits the issue where different Union results in different schemas, albeit doesn't quite match the same where the same Union can have different ordering going in. My expectation would be to get both test cases to start passing. https://github.com/marcosschroh/dataclasses-avroschema/compare/master...lgo:dataclasses-avroschema:joey-make-union-schema-deterministic?expand=1

marcosschroh commented 1 month ago

@lgo I do not think get_args is the problem. Actually it is deterministic:

from typing import Optional, Dict, Union, get_args

types = Optional[
        Dict[str, Optional[Union[str, int, float, bool, Dict[str, Optional[str]]]]]
    ]

result = get_args(types)
for _ in range(1, 1000):
    if result != get_args(types):
        raise ValueError

I always get the same result (typing.Dict[str, typing.Union[str, int, float, bool, typing.Dict[str, typing.Optional[str]], NoneType]], <class 'NoneType'>).

For the schema is the same, it is deterministic

class Foo(AvroModel):
  value: Optional[
        Dict[str, Optional[Union[str, int, float, bool, Dict[str, Optional[str]]]]]
    ]

result = Foo.avro_schema()

for _ in range(1, 1000):
    if result != Foo.avro_schema():
        raise ValueError

The result is always the same:

{"type": "record", "name": "Foo", "fields": [{"name": "value", "type": [{"type": "map", "values": ["string", "long", "double", "boolean", {"type": "map", "values": ["string", "null"], "name": "value"}, "null"], "name": "value"}, "null"]}]}

I think your problem is in the way that you are comparing the avro-json. How are you generating the values? with fake?

lgo commented 1 month ago

To clarify, the issue I'm facing is execution of the same Python code (/schemas) via different entrypoints, where unrelated code that is loaded affects the results. Sorry for the confusion! I got this minimal repro working now that I've also understood this more (:

import os.path
import os
from typing import List, Union
from dataclasses_avroschema import AvroModel
import json
import sys

if len(sys.argv) != 2:
  print("Usage: repro.py <temp schema filename>")
  sys.exit(1)

schema_file = sys.argv[1]

# Uncomment this after the first run.
#
# some_other_val: List[Union[str, None, int]] = {}

class Foo(AvroModel):
  value: List[Union[str, int, None]]

result = Foo.avro_schema()

# On first run, write the schema.
if not os.path.isfile(schema_file):
  with open(schema_file, 'w') as f:
    f.write(json.dumps(result))
  sys.exit(0)

# Otherwise, compare it.
with open(schema_file, 'r') as f:
  prev_result = json.loads(f.read())
  if prev_result != result:
    print("Previous: ", prev_result)
    print("Current: ", result)
    sys.exit(1)

$ python /Users/joeyp/Downloads/repro.py /tmp/schema_file.json

# Uncomment the indicated `some_other_val`

$ python /Users/joeyp/Downloads/repro.py /tmp/schema_file.json
Previous:  {"type": "record", "name": "Foo", "fields": [{"name": "value", "type": {"type": "array", "items": ["string", "long", "null"], "name": "value"}}]}
Current:  {"type": "record", "name": "Foo", "fields": [{"name": "value", "type": {"type": "array", "items": ["string", "null", "long"], "name": "value"}}]}

And we can see the first run has "string", "long", "null" whereas the second has "string", "null", "long"

Relating this back to the mention on the get_args page, Python happens to have some type caching of Union within generic types (e.g. Dict, List, so on) that means that results can change simply based on what/when code is loaded that has an equivalent type (Union[str, None] vs Union[None, str]). In this case, on the second run, the fact we have some_other_val load first with Union[str, None, int] influences the actual get_args result on Foo#value regardless of what it's Union order is.

In practice, this has happened as I have some code like generate_schemas.py that writes schemas, and some test_schemas_updated.py which runs in CI to read the existing files and compare them to current schemas dumped from objects. With the different entrypoints and code loading (/ code load order), we get these different JSON results for the schemas.

marcosschroh commented 3 weeks ago

Hi @lgo

Thanks for the clarification. I think we can not do anything about it as the python cache works in this way. I found the python issue related to our behavior, it seems that they are working on it. This is also affecting pydantic.

I seems that you will have to find a work around and load your in a different way and wait for the fix in python.

Regarding python-dateutil dependency, I can set python-dateutil = "^2". Does it work for you?

marcosschroh / dataclasses-avroschema

Non-deterministic schema generation for `Optional[Union[...]]` types #691