lucsorel / py2puml

Generate PlantUML class diagrams to document your Python application.
https://pypi.org/project/py2puml/
MIT License
212 stars 34 forks source link

Add support for SQLAlchemy models #56

Open Baukebrenninkmeijer opened 1 year ago

Baukebrenninkmeijer commented 1 year ago

Issue to add support for SQLAlchemy models. This will include considerations about 1-1, 1- and -* relationships. Additionally, it should also consider inheritance.

lucsorel commented 1 year ago

would you have a snippet of Python code describing a sample domain model involving SQLAlchemy, please? Something like this Pydantic example: https://docs.pydantic.dev/latest/usage/models/#nested-models.

That would help contributing to this issue.

Second question: would you agree on reviewing the corresponding pull request when it is done?

Baukebrenninkmeijer commented 1 year ago

@lucsorel Yes, i'll create some tomorrow. Naturally, I will help with the PR and review when done. Are you also willing to help out? I would appreciate it immensely, but please only do if you have enough free time.

lucsorel commented 1 year ago

Thank you for your help, I appreciate it :pray:

For now, I am working with other people on #43. When it is done, I will be able to work on this one. I need someone to bootstrap me on SQLAlchemy (the sample domain model), then I'll be able see what to do.

I maintain py2puml on my free time, which varies a lot from week to week.

Moreover, there have been some older requests that I should handle too. So no ETA for any feature, I am sorry :shrug:

Baukebrenninkmeijer commented 1 year ago

@lucsorel Apologies it took a bit longer. You know how these things go.

Below, i've created a little snippet with 3 classes. It tries to have both a foreign key relationship in a one-to-many fashion, as well as use inheritance for an abstract base class.

from sqlalchemy import Boolean, Column, ForeignKey
from sqlalchemy.ext.declarative import AbstractConcreteBase
from sqlalchemy.orm import declarative_base, relationship
from sqlalchemy.types import Date, DateTime, Float, Integer, String

example_base = declarative_base()

class User(AbstractConcreteBase, example_base):
    user_id = Column(Integer, autoincrement=True, primary_key=True)
    username = Column(String(100))
    salted_pw = Column(String(100))
    created_at = Column(DateTime)
    last_login = Column(DateTime)
    age = Column(Integer)
    height = Column(Float)

class Login(example_base):
    __tablename__ = "logins"
    _id = Column(Integer, primary_key=True, autoincrement=True)
    login_timestamp = Column(DateTime)
    user_id = Column(Integer, ForeignKey('special_users.user_id'))

class specialUser(User):
    __tablename__ = 'special_users'
    __mapper_args__ = {"polymorphic_identity": "special_users", "concrete": True}

    special_id = Column(Integer, primary_key=True, autoincrement=True)
    admin_access = Column(Boolean)
    clearance_number = Column(Integer)
    logins = relationship('Login', backref='user')
lucsorel commented 1 year ago

:pray: thank you for the snippet.

I you are familiar with PlantUML code, can you draw a class diagram in the online editor and share the link (make sure that the url to the diagram has been refreshed), please?

Honestly, I apprehend writing the parser that will handle such code :/

  1. there are fields dedicated to storing data: handling them should be rather simple, although it depends on the datatype we want to see in the class diagram associated to the field: for User.username, should it be Column, Column[String], String or str?
  2. the datatypes are not the Python builtins ones (int, bool, str, etc.) and do not even belong to the same module (sqlalchemy.Boolean, sqlalchemy.types.Integer): which ones do we want to see in the class diagram? The SQLAlchemy datatypes or the native underlying ones? (I would go for the SQLAlchemy datatypes, for simplicity sake)
  3. there are standardized fields dedicated to the database structure (__tablename__, __mapper_args__, maybe several others): should they be included in the class diagram or omitted?
  4. there are fields dedicated to relationships (which have no standardized names, like specialUser.logins, which make them hard to identify programmatically): should they be included in the class diagram or omitted?
  5. another issue is to identify that the class is a SQLAlchemy domain model: the sqlalchemy.orm.declarative_base function is imported (which can be done in several different ways), then called and its result assigned to a variable, then this variable is used as a parent class of the domain model

This is not a simple feature you are asking for :worried:

FynnFreyer commented 1 year ago

@lucsorel

I've stumbled upon this project today (looking great, btw), and have the same problem as @Baukebrenninkmeijer. Here are my thoughts, hope they help.

SQLAlchemy 1.4 vs 2.0

AFAIK, this is the old (1.4) style of declaring models. This is still supported in version 2.0, but examples in the documentation seem to prefer the new style declaration.

I'd think 1.4 is still more often seen in the wild, but it might make sense to skip 1.4 style and immediately support the new declarative syntax. The new declarative API works via hinting native types, and is therefore much more amenable to inspection, which should make implementing this a lot easier.

That should at least take care of caveats 1 and 2. And RE: 3 my personal opinion is that it would be nice, but it's much better to have a way to automatically generate a diagram without that information, than to have no way at all.

For comparison, I've ported the sample to the newer 2.0 syntax.

from datetime import datetime
from typing import Optional

from sqlalchemy import ForeignKey, create_engine
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, relationship
from sqlalchemy.types import String

class ExampleBase(DeclarativeBase):
    pass

class User(ExampleBase):
    __abstract__ = True

    user_id: Mapped[int] = mapped_column(primary_key=True)
    username: Mapped[Optional[int]] = mapped_column(String(100))
    salted_pw: Mapped[Optional[int]] = mapped_column(String(100))
    created_at: Mapped[Optional[datetime]] = mapped_column()
    last_login: Mapped[Optional[datetime]] = mapped_column()
    age: Mapped[Optional[int]] = mapped_column()
    height: Mapped[Optional[float]] = mapped_column()

class Login(ExampleBase):
    __tablename__ = "logins"

    _id: Mapped[int] = mapped_column(primary_key=True)
    login_timestamp: Mapped[Optional[datetime]] = mapped_column()
    user_id: Mapped[Optional[int]] = mapped_column(ForeignKey("special_users.user_id"))

class SpecialUser(User):
    __tablename__ = "special_users"
    __mapper_args__ = {"concrete": True}

    special_id: Mapped[int] = mapped_column(primary_key=True)
    admin_access: Mapped[Optional[bool]] = mapped_column()
    clearance_number: Mapped[Optional[int]] = mapped_column()
    logins = relationship("Login", backref="user")

if __name__ == "__main__":
    # check generated SQL with this
    engine = create_engine(f"sqlite://", echo=True)
    ExampleBase.metadata.create_all(engine)

Compared to the old style example, this produces identical SQL code as far as I can see. As a side note, concrete table inheritance as used in this example is usually discouraged, and more of an interesting edge case I'd argue.

Class diagrams

Caveat emptor:

I don't know how py2puml handles imports, as I have never used it before, so I'm using a very naive approach for this. Adjust accordingly.

Also, I'm generally somewhat mediocre when it comes to UML, so please excuse if this is not ideal. For example, if there's a need to fully model the imports, I have absolutely no idea how to deal with metaclasses in UML.

The very naive, but probably good enough approach
The very naive, but probably good enough approach
And the PUML code @startuml package sqlalchemy.orm { class DeclarativeBase } class ExampleBase class User { user_id: int username: Optional[int] salted_pw: Optional[int] created_at: Optional[datetime] last_login: Optional[datetime] age: Optional[int] height: Optional[float] } class Login { _id: int login_timestamp: Optional[datetime] user_id: Optional[int] } class SpecialUser { special_id: int admin_access: Optional[bool] clearance_number: Optional[int] logins: Any } DeclarativeBase <|-- ExampleBase ExampleBase <|-- User ExampleBase <|-- Login User <|-- SpecialUser @enduml

This doesn't take care of properly typing relationships at all, and doesn't look for the __abstract__ member to figure out that User is abstract.

The very nice, but probably quite complicated approach
The very nice, but probably quite complicated approach
And the PUML code @startuml package sqlalchemy.orm { class DeclarativeBase } class ExampleBase abstract class User { user_id: int username: Optional[int] salted_pw: Optional[int] created_at: Optional[datetime] last_login: Optional[datetime] age: Optional[int] height: Optional[float] } class Login { _id: int login_timestamp: Optional[datetime] user_id: Optional[int] } class SpecialUser { special_id: int admin_access: Optional[bool] clearance_number: Optional[int] logins: Login <> } DeclarativeBase <|-- ExampleBase ExampleBase <|-- User ExampleBase <|-- Login User <|-- SpecialUser SpecialUser o-- Login @enduml

This correctly identifies the relationship, and that User is abstract. Getting the abstract classes right should be easy, since one only has to look for an __abstract__ member that is set True, and possibly __mapper_args__ in the inheriting classes.

I guess implementing the relationship typing would be doable. SpecialUser.logins is a relationship, so SpecialUser.logins.property has a type of Relationship.

Getting the association right would be quite hard though. You'd have to remember the __tablename__ for every class, and then take a look at e.g. SpecialUser.logins.property.local_remote_pairs, and try to go from there.

Follow imports option

Another potentially easier solution for this Issue would be to add an option to not follow imports, like mypy, but I don't know how feasible that is, given py2puml's design.

Contributing some time

I'd be willing to spend some time to help work on this, but honestly, I don't think I'd be much help implementing anything on the py2puml side of things So if there's some groundwork I could do, or assist in another way to make this easier for you, I'd be happy to help.

Baukebrenninkmeijer commented 1 year ago

@FynnFreyer Thanks! Yes, this is exactly the elaboration needed, and is extremely useful!

In terms of skipping all <2.0 version, I also considered that but given that most codebases will not have migrated and will use an older version, my preference would then be to see if we can support both. With the new 2.0 mapper system, things get easier using typehinting but extracting a type from an SQLAlchemy object was fairly easy before as well, since it had mapping to python types (i.e. model.__table__.columns[0].type.pythonType).

The diagram I would have created would be the same as you created above, @FynnFreyer. BTW I copied the dunder tags from what we are using at work - i'm not entirely sure about whether they are required, such as the mapper args.

FynnFreyer commented 1 year ago

@Baukebrenninkmeijer

I fear you're right. Version two is quite new, and SQLAlchemy quite old, so only supporting 2.0 might not cut it in a lot of cases. Using model.__table__.columns[i].type.python_type seems like a pretty good solution here, and works equally well with both kinds of usage, I just checked. I didn't see this attr before, but in that case I don't think only supporting version 2.0 is a big improvement over just using that.

I guess the biggest conceptional issue left would be point 5, programatically identifying that this is an SQLAlchemy domain model. A quick and dirty way would be to check whether model.__class__.__name__ evaluates to the metaclass that SQLAlchemy models use ('DeclarativeAttributeIntercept').

FynnFreyer commented 1 year ago

One more thing I just noticed: you'd have to check model.__table__.columns[i].nullable as well, to identify whether this is an optional value or not.

lucsorel commented 12 months ago

Thank you @Baukebrenninkmeijer and @FynnFreyer for the thorough description of your need and hints to a solution :pray:

However, most of the hints you gave rely on code inspection (actually loading the module - meaning that the python interpreter actually runs them, identifying the classes, checking some class attributes like model.__table__.columns[i].type.python_type, model.__table__.columns[i].nullable, model.__class__.__name__) and performing some heuristics to filter in only the relevant attributes (the ones which hold data) and to filter out the ones related to relationships (which would be replaced by inheritance or association UML arrows).

Code inspection is a great and powerful feature of Python and it is what I used when I started this tool because it was so convenient. But it is also painful and risky (see https://github.com/lucsorel/py2puml/issues/59#issuecomment-1720155485: the 3rd bullet mentions that the python interpreter actually runs the code and some unexpected side effect can happen like writing in database or removing files, etc. - I don't want to be responsible for that). I had a the idea of removing code inspection and rely only on abstract syntax tree (AST) parsing to do the job, but it is a contributor who actually created the #59 issue to get rid of code inspection.

Some downsides of AST parsing is that it is harder to implement heuristics and py2puml features, that py2puml will miss class attributes added dynamically (by decorators, for example). But as a maintainer, I would feel safer when I know that py2puml does not execute the code that it inspects.

It seems to me that extracting the domain model from an SQLAlchemy model should be feasible with AST parsing, but with different heuristics relying on type annotations (SQLAlchemy 2+) or on the expression assigned to the class fields (SQLAlchemy 1.4); what do you think?

Baukebrenninkmeijer commented 12 months ago

@lucsorel These are 100% fair considerations and I think it should be doable with just the AST as well. Given that for the 1.4 version, it's mentioned in column and for 2.0 it's mentioned as a quasi-normal python type, do you think it will be significantly more complex than parsing other types of code? Compared with pydantic for example, they also have the type as a typehint similarly to sqlalchemy 2.0, so my expectation is that similar approaches can be leverages.

In terms of sequentiality and effort, I'd start with the 2.0 implementation, given my additional expectation of that being easier since it uses typehints. We will in the (hopeful) near future also migrate to 2.0 so maybe that should just be the way to go.

For additional future proofing, naturally support for 1.4 would be nice, but I think this is, given the circumstances, the better approach forward.

lucsorel commented 12 months ago

do you think it will be significantly more complex than parsing other types of code? Compared with pydantic for example?

It should not be very different; but for now, such classes (where an abstract class or a decorator - like @dataclass - turns the class attributes into instance attributes) are analyzed through code inspection (by regarding the contents of MyClass.__annotations__). Only class constructors (the __init__(self, ...) functions) are analyzed through AST parsing for now, and it was quite tedious to detect all the types of self.an_attribute = an_attribute assignments.

We'll see :smiley: