For discussion: adding Pydantic for metadata models

This is a proposal for how to represent complex nested data structures from the Airtable API. This proposal would benefit from, but does not strictly require, removing ApiAbstract (see #257).

tl;dr

I'd like to add pydantic as a dependency and use that to serialize and deserialize Airtable models from their metadata APIs.

Rationale

Most Python developers these days use supportive development tools that can provide type hints, autocomplete, and more. Developers who need to interact with the nested data structures returned by the Airtable API would benefit from being able to navigate those within their code editors' tooling.

Some projects using this library might also want to enforce strict typing, and today there's no common way for them to ensure that the properties they reference on pyairtable's return types actually exist.

Design

The module layout is very open to discussion, but it could be something like this:

pyairtable/
- metadata/ (new)
  - field_schema.py: Models for each type of field config.
  - base_schema.py: Models for base and table schemas, permissions, and invites.
  - webhooks.py: Models for interacting with webhooks.
- api/
  - api.py: We might add some utility methods for accessing metadata APIs.
  - enterprise.py (new): Define an Enterprise class and methods for managing users and retrieving audit logs.
  - workspace.py (new): Define a Workspace class and methods for managing workspace permissions.
  - base.py: Add Base methods for retrieving and managing base schemas, permissions, and webhooks.
  - table.py: Add Table methods for retrieving and managing table/field schemas.

Example

From a user's perspective, this will be relatively transparent. They will call methods that we expose on classes defined in pyairtable.api, and retrieve normal Python data structures that they can interact with:

>>> table = Table("api_key", "appId", "tblId")
>>> schema = table.get_schema()
>>> type(schema)
<class 'pyairtable.metadata.base_schema.TableSchema'>
>>> schema.id
'tblId'
>>> schema.fields[0].id
'fld1VnoyuotSTyxW1'
>>> schema.fields[0].type
'singleLineText'

For now I'm not envisioning these data structures knowing how to call the API or save modifications to themselves. We can probably start with bespoke methods for each type of modification, for example:

>>> table = Table("api_key", "appId", "tblId")
>>> table.get_schema().description
'Apartments to track.'
>>> table.update_schema(description="Apartments we're tracking.")
>>> table.get_schema().description
"Apartments we're tracking."

I haven't taken the time to think through the exact names/signatures of every method we'd add, but I think we can probably consider those as we go.

Deprecation

We would mark the existing functions in pyairtable.metadata as deprecated, for removal in 3.0.0. Alternatively we could mark them as deprecated in a point release (1.5.1) and then remove them in 2.0.0. My instinct is to err toward compatibility.

Future

A couple other ideas that I haven't explored much:

It is possible we could make ORM-like features with these objects, such as manipulating their state and calling .save() directly. For now I've not contemplated this too deeply, as I am mostly focused on being able to read state from the API.
We could have a dict-like (backwards compatible) Record dataclass that defines id, created_time, and fields. I consider that out of scope for this proposal because it's data and not metadata. I think pyairtable.orm is a better pattern to follow.

Alternatives considered

Just return Dict[str, Any]. Sure, it works, but where's the fun in that? :grin:
Just return TypedDicts. The number of TypedDict definitions to create and maintain would make this alternative no less complex or burdensome for the package's maintainers, but it would represent significantly less functionality for developers who use this library.
Use dataclass-factory. We've used this library in the past with some success. However, it has not seen updates for several months, so I thought it might be prudent for us to rely on pydantic instead (which is being actively developed).
Use dataclasses-json. Same rationale as above.
Use dataclasses-jsonschema. Same rationale as above.

Thoughts?

@mesozoic , could you please clarify if there will also be a provision for generating the Pydantic schema for the table metadata?

@mesozoic , could you please clarify if there will also be a provision for generating the Pydantic schema for the table metadata?

My first thought here is just to use Pydantic to represent all the complex metadata we get back from the API (schemas, webhooks, etc). Building a Pydantic model to reflect a table's data is an interesting idea, but I'm not sure how useful it will be (since the ORM module does not use Pydantic under the hood).

If I'm not quite getting your meaning, perhaps you could clarify your use case?

The way I'm using pyairtable with Pydantic at the moment, and I'm pretty new to Pydantic:

Base class for any AirTabel data:

Note that the record_id and record_created_time are excluded from serialization. This way we can get a record, modify it, and call table.update() on the same object.

from pydantic import BaseModel, Field

class AirTableRow(BaseModel):
    record_id: str = Field(None, alias=str("id"), exclude=True)
    record_created_time: str = Field(None, alias=str("createdTime"), exclude=True)

    class Config:
        extra = "forbid" # Catch typos and field name changes
        allow_population_by_field_name = True

    @classmethod
    def from_dict(cls, d):
        return cls(record_id=d["id"], record_created_time=d["createdTime"], **d["fields"])

Now for each table, we need to define the schema.

Note that the calculated field, again, is excluded from serialization, as it can't be part of an update.

class Data(AirTableRow):
    field1: str = Field(None, alias=str("Field 1"))
    field2: int = Field(None, alias=str("Field 2"))

    calculated_field: int = Field(None, alias=str("Calculated Field"), exclude=True)

Now using the class with pyairtable:

table = pyairtable.Table(AT_API_KEY, "appDsQdcFsh1bJlGE", "Test")

data = [ Data.from_dict(d) for d in table.all() ]
data

[Data(record_id='rec4yN9Jr6cjH6zbW', record_created_time='2023-07-09T07:39:40.000Z', field1='Hello there.', field2=345, calculated_field=357),
 Data(record_id='recJSJdoiBdNOqeTP', record_created_time='2023-07-09T07:39:40.000Z', field1='General Kenobi!', field2=123, calculated_field=138)]

Here is what happens if we serialize the data:

list(map(lambda x: x.dict(by_alias=True, exclude_unset=True), data))

[{'Field 1': 'Hello there.', 'Field 2': 345},
 {'Field 1': 'General Kenobi!', 'Field 2': 123}]

So to update a record, we can do

data[1].field2 = 4321
table.update( data[1].record_id, data[1].dict(by_alias=True, exclude_unset=True) )

{'id': 'rec4yN9Jr6cjH6zbW',
 'createdTime': '2023-07-09T07:39:40.000Z',
 'fields': {'Field 1': 'Lalalala', 'Field 2': 345, 'Calculated Field': 353}}

Or with a new Data object (I split it into 2 lines for better readability):

update = Data(field1="Lalalala")
table.update(data[0].record_id, update.dict(by_alias=True, exclude_unset=True))

{'id': 'recJSJdoiBdNOqeTP',
 'createdTime': '2023-07-09T07:39:40.000Z',
 'fields': {'Field 2': 4321,
  'Field 1': 'General Kenobi!',
  'Calculated Field': 4336}}

This provides at least some type of safety and working IntelliSense. There is definitely space for improvement, for examples we could have a way to

data = Data.all()

return a List[Data].

What I'm suggesting is, let's have an official tool that takes the table schema from airtable, and generates the pydantic boilerplate. Would it make sense? Is there enough information provided by the API to generate it automatically?

@xl0 What you're describing seems like an interesting approach to consider when we get around to autogenerating ORM classes from table schemas (probably 3.0; see roadmap in #249). I think we'll need to weigh whatever advantages or new features it provides against whatever ways it might break backwards-compatibility with the current ORM module.

This thread was intended solely to suggest using Pydantic (vs. plain old dicts) for metadata like schemas, webhooks, etc. Seems like that's probably acceptable, since this is not the only thread where I've heard general enthusiasm for using Pydantic in more places :)

gtalarico / pyairtable