pluralsight / spavro

Spavro is a (sp)eedier avro library -- Spavro is a fork of the official Apache AVRO python 2 implementation with the goal of greatly improving data read deserialization and write serialization performance.
Apache License 2.0
26 stars 15 forks source link

Issue with DataFileWriter and union of records. #3

Closed shawnsarwar closed 6 years ago

shawnsarwar commented 6 years ago

I'm trying to serialize a message with its schema, but I'm getting the following error from spavro:

Traceback (most recent call last):
  File "src/spavro/fast_binary.pyx", line 702, in spavro.fast_binary.get_writer
  File "src/spavro/fast_binary.pyx", line 496, in spavro.fast_binary.make_union_writer
  File "src/spavro/fast_binary.pyx", line 360, in spavro.fast_binary.get_check
  File "src/spavro/fast_binary.pyx", line 364, in spavro.fast_binary.make_record_check
  File "src/spavro/fast_binary.pyx", line 360, in spavro.fast_binary.get_check
  File "src/spavro/fast_binary.pyx", line 408, in spavro.fast_binary.make_union_check
  File "src/spavro/fast_binary.pyx", line 360, in spavro.fast_binary.get_check
KeyError: 'org.eha.demo.parents'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./manage.py", line 19, in <module>
    aether_producer.main_loop()
  File "/code/producer/aether_producer.py", line 277, in main_loop
    manager.send(entities)
  File "/code/producer/aether_producer.py", line 250, in send
    self.streams[topic].send(row)
  File "/code/producer/aether_producer.py", line 177, in send
    writer = DataFileWriter(bytes_writer, DatumWriter(), self.schema, codec='deflate')
  File "/usr/local/lib/python3.6/site-packages/spavro/datafile.py", line 99, in __init__
    self.datum_writer.writers_schema = writers_schema
  File "/usr/local/lib/python3.6/site-packages/spavro/io.py", line 820, in writers_schema
    self.write_datum = get_writer(parsed_writer_schema.to_json())
  File "src/spavro/fast_binary.pyx", line 706, in spavro.fast_binary.get_writer
KeyError: 'union'

I can perform the operation in both avro and avro-python3. Also, the message validates against the schema in avro, avro-python3 and spavro.

Here is the function in question:

import json
obj = None
schema = None
with open("./payload.json") as f:
    obj = json.load(f)

with open("./schema.json") as f:
    schema = json.load(f)
#spavro (current)
def t2(obj, schema):

    from io import BytesIO
    from spavro.schema import parse
    from spavro.io import validate
    from spavro.io import DatumWriter
    from spavro.datafile import DataFileWriter
    bytes_writer = BytesIO()
    avsc = parse(json.dumps(schema, indent=2))
    res = validate(avsc, obj)
    print(res)

    with DataFileWriter(bytes_writer, DatumWriter(), avsc, codec='deflate') as writer:
        writer.append(obj)
        writer.flush()
        raw_bytes = bytes_writer.getvalue()

The schema is:

[
        {
            "name": "org.eha.demo.children",
            "type": "record",
            "label": "children",
            "fields": [
                {
                    "name": "Person-id",
                    "type": [
                        "string",
                        {
                            "type": "array",
                            "items": "string"
                        }
                    ],
                    "jsonldPredicate": {
                        "_id": "org.eha.demo.Person",
                        "_type": "@id"
                    }
                }
            ],
            "namespace": "org.eha.demo"
        },
        {
            "name": "org.eha.demo.home",
            "type": "record",
            "label": "home",
            "fields": [
                {
                    "name": "Text",
                    "type": [
                        "null",
                        "string",
                        {
                            "type": "array",
                            "items": "string"
                        }
                    ],
                    "jsonldPredicate": "xsd:string"
                },
                {
                    "name": "Place-id",
                    "type": [
                        "string",
                        {
                            "type": "array",
                            "items": "string"
                        }
                    ],
                    "jsonldPredicate": {
                        "_id": "org.eha.demo.Place",
                        "_type": "@id"
                    }
                }
            ],
            "namespace": "org.eha.demo"
        },
        {
            "name": "org.eha.demo.parents",
            "type": "record",
            "label": "parents",
            "fields": [
                {
                    "name": "Person-id",
                    "type": [
                        "string",
                        {
                            "type": "array",
                            "items": "string"
                        }
                    ],
                    "jsonldPredicate": {
                        "_id": "org.eha.demo.Person",
                        "_type": "@id"
                    }
                }
            ],
            "namespace": "org.eha.demo"
        },
        {
            "name": "org.eha.demo.Person",
            "type": "record",
            "label": "Person",
            "fields": [
                {
                    "name": "id",
                    "type": "string",
                    "inherited_from": "org.eha.demo.BaseModel",
                    "jsonldPredicate": "@id"
                },
                {
                    "name": "rev",
                    "type": "string",
                    "inherited_from": "org.eha.demo.BaseModel"
                },
                {
                    "name": "parents",
                    "type": [
                        "null",
                        "org.eha.demo.parents"
                    ],
                    "jsonldPredicate": "org.eha.demo.parents"
                },
                {
                    "doc": "A Name",
                    "name": "name",
                    "type": [
                        "null",
                        {
                            "type": "array",
                            "items": "string"
                        },
                        "string"
                    ],
                    "jsonldPredicate": "org.eha.demo.name"
                },
                {
                    "doc": "ISO date as a string",
                    "name": "dateOfBirth",
                    "type": [
                        "null",
                        {
                            "type": "array",
                            "items": "string"
                        },
                        "string"
                    ],
                    "jsonldPredicate": "org.eha.demo.dateOfBirth"
                },
                {
                    "name": "children",
                    "type": [
                        "null",
                        "org.eha.demo.children"
                    ],
                    "jsonldPredicate": "org.eha.demo.children"
                },
                {
                    "doc": "A description of the thing.",
                    "name": "description",
                    "type": [
                        "null",
                        {
                            "type": "array",
                            "items": "string"
                        },
                        "string"
                    ],
                    "jsonldPredicate": "org.eha.demo.description"
                },
                {
                    "name": "home",
                    "type": [
                        "null",
                        "org.eha.demo.home"
                    ],
                    "jsonldPredicate": "org.eha.demo.home"
                }
            ],
            "namespace": "org.eha.demo",
            "aetherBaseSchema": true
        }
    ]

The message body is:

{
    "id": "c55fa41c-f2cb-4378-9a58-631d417fc085",
    "description": ["wlpqz", "bfg"],
    "home": {
        "Place-id": "38de9d89-f2e4-4639-884b-0f6e2246b4df",
        "Text": "srjygtwcd"
    },
    "dateOfBirth": ["unzmgdpweh", "lqfjvtigdkn"],
    "children": null,
    "name": "mgzn",
    "rev": "tqyams",
    "parents": {
        "Person-id": ["2e08b855-345c-4573-bd45-cdf1a25624f6", "503d0b26-989b-4967-871d-2c3194616013", "517bb3a0-26c3-4708-85c3-4ed4c6fb95af"]
    }
}
mikepk commented 6 years ago

Thanks for the report, I'll try to repro and get back to you.

mikepk commented 6 years ago

Thanks for the bug report! I managed to reproduce the case you submitted. Turns out there were two bugs that interacted here. One is an issue with how the namespaces were being processed internally inside spavro, as it has it's own tracking for the recursive writer functions outside the schema parsing pass. It wasn't honoring the 'if it has a dot in the name it's the fully qualified name' part of the spec. The second is that when processing records, spavro checks the datum to see if it conforms to the schema for matching in Union schemas. The 'array' check code wasn't making sure that the datum was iterable before trying to verify that it matched the array schema, leading to an exception when trying to iterate over non lists.

commit with changes: 2c99b3f7da7d26ab4a09758ec2f25ef387214d2a

Let me know if this fixes the issue for you, I've pushed a new version of spavro 1.1.11 to pypi as well with the fix(es).

mikepk commented 6 years ago

Let me know, @shawnsarwar, if you're OK with me closing this issue.

mikepk commented 6 years ago

I haven't heard anything so I'm closing this, I believe it is fixed.

shawnsarwar commented 6 years ago

I need to turn on notifications. I didn't see this until it had been closed. I'll test and if it's not resolved I'll open a new ticket referencing this one.

shawnsarwar commented 6 years ago

I'm getting the exact same error with spavro 1.1.15. I'm running it in pipenv so I'm pretty sure I'm running on the right version. The only apparent thing to change is the line number of the throwing error.

        "spavro": {
            "hashes": [
                "sha256:25e2994564df461baf739d2825e2451d5875de5583d0ed8c92070738661a6f4a"
            ],
            "version": "==1.1.15"
        }

Since the detail of the issue are the same I'll hold of opening a new ticket unless I hear from you.

Traceback (most recent call last):
  File "src/spavro/fast_binary.pyx", line 705, in spavro.fast_binary.get_writer
  File "src/spavro/fast_binary.pyx", line 499, in spavro.fast_binary.make_union_writer
  File "src/spavro/fast_binary.pyx", line 363, in spavro.fast_binary.get_check
  File "src/spavro/fast_binary.pyx", line 367, in spavro.fast_binary.make_record_check
  File "src/spavro/fast_binary.pyx", line 363, in spavro.fast_binary.get_check
  File "src/spavro/fast_binary.pyx", line 411, in spavro.fast_binary.make_union_check
  File "src/spavro/fast_binary.pyx", line 363, in spavro.fast_binary.get_check
KeyError: 'org.eha.demo.parents'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./test.py", line 68, in <module>
    test(obj, schema)
  File "./test.py", line 42, in t2
    with DataFileWriter(bytes_writer, DatumWriter(), avsc, codec='deflate') as writer:
  File "/home/sarwar/.local/share/virtualenvs/avro-check-in6pzLTy/lib/python3.5/site-packages/spavro/datafile.py", line 99, in __init__
    self.datum_writer.writers_schema = writers_schema
  File "/home/sarwar/.local/share/virtualenvs/avro-check-in6pzLTy/lib/python3.5/site-packages/spavro/io.py", line 820, in writers_schema
    self.write_datum = get_writer(parsed_writer_schema.to_json())
  File "src/spavro/fast_binary.pyx", line 709, in spavro.fast_binary.get_writer
KeyError: 'union'
mikepk commented 6 years ago

Ok I've got a new process for uploading to pypi and it somehow failed to get the C code for the extension in v1.1.15. I've uploaded a new version v1.1.16 that should have the updated extension. Let me know if that version works and I'll close this.

mikepk commented 6 years ago

Argh and the docs broke on pypi again this new pypi is a pain in my butt.

mikepk commented 6 years ago

I'll leave this issue open until I hear from you @shawnsarwar (not sure if you turned on notifications).

shawnsarwar commented 6 years ago

Thanks! I'll take a look and report back on Monday.

shawnsarwar commented 6 years ago

Working as advertised. Thanks!

mikepk commented 6 years ago

Good, I'm glad. I'd love to get your experiences using Spavro. How did you find it? Has it been easy to work with? Performant? I'm debating whether to publicize it more.

shawnsarwar commented 6 years ago

Our current applications for avro aren't terribly performance sensitive yet, so the primary impetus for choosing spavro was the API compatibility with the "official" lib and between python 2 & 3. On all those fronts it's been great. Normally I'd be hesitant to use something like spavro ( non-trivial complexity, using c extensions where I don't yet require performance, competing with a more official implementation ), but avro and avro-python3 are so neglected, I think you've got a great case for wide adoption.