jpmorganchase / py-avro-schema

Generate Apache Avro schemas for Python types including standard library data-classes and Pydantic data models.
https://py-avro-schema.readthedocs.io/
Apache License 2.0
37 stars 6 forks source link

Use Pydantic's `model_config` to allow record name override #72

Closed cbini closed 6 months ago

cbini commented 7 months ago

I'm migrating over to using Pydantic for defining and this library to generate Avro schema. It's awesome, but one challenge I'm running into is being able to override record names to maintain backwards compatibility.

For example, if I have one data asset partitioned across multiple Avro files in my data lake, BigQuery cannot successfully piece them together unless each file has consistent record names.

The workaround I've found is to subclass my Pydantic model with the legacy naming convention, but I feel like it would be much better if this library considered Pydantic's model_config. That way, I could keep everything in one, nicely named class, like so:

class StudentModel(BaseModel):
    model_config = ConfigDict(title="student_record")

    StudentID: str | None = None
    StudentSchoolID: str | None = None
    SecondaryStudentID: str | None = None
    StudentFirstName: str | None = None
    StudentMiddleName: str | None = None
    StudentLastName: str | None = None

Pydantic's BaseModel.model_json_schema() method behaves this way, but the way pas.generate() is set up, it just uses the literal class name.

faph commented 7 months ago

Hi @cbini

That seems a useful enhancement indeed.

Would this work for everybody always? I agree that Pydantic's notation of "title" maps nicely onto JSON Schema "title".

Without knowing JSON Schema though, I might interpret Pydantic's "title" concept as a human-friendly name in addition to the system-friendly class name. I guess Pydantic allows any string as title? Say including spaces? That would not necessarily lead to a valid Avro record schema...

faph commented 6 months ago

Hi @cbini, do you still require this?

cbini commented 6 months ago

This would be a nice-to-have in a future version. The workaround I've implemented now is good enough though--haven't had much bandwidth to revisit it.

faph commented 6 months ago

Understood. Let’s keep this open.

I’m thinking this should be controlled through an additional option flag similar to what we use for Pydantic field aliases.

faph commented 6 months ago

This PR ^ add the following option USE_CLASS_ALIAS, similar to USE_FIELD_ALIAS. On top of that, it implements Avro name validation across the board.