Open Baukebrenninkmeijer opened 1 year ago
would you have a snippet of Python code describing a sample domain model involving SQLAlchemy, please? Something like this Pydantic example: https://docs.pydantic.dev/latest/usage/models/#nested-models.
That would help contributing to this issue.
Second question: would you agree on reviewing the corresponding pull request when it is done?
@lucsorel Yes, i'll create some tomorrow. Naturally, I will help with the PR and review when done. Are you also willing to help out? I would appreciate it immensely, but please only do if you have enough free time.
Thank you for your help, I appreciate it :pray:
For now, I am working with other people on #43. When it is done, I will be able to work on this one. I need someone to bootstrap me on SQLAlchemy (the sample domain model), then I'll be able see what to do.
I maintain py2puml
on my free time, which varies a lot from week to week.
Moreover, there have been some older requests that I should handle too. So no ETA for any feature, I am sorry :shrug:
@lucsorel Apologies it took a bit longer. You know how these things go.
Below, i've created a little snippet with 3 classes. It tries to have both a foreign key relationship in a one-to-many fashion, as well as use inheritance for an abstract base class.
from sqlalchemy import Boolean, Column, ForeignKey
from sqlalchemy.ext.declarative import AbstractConcreteBase
from sqlalchemy.orm import declarative_base, relationship
from sqlalchemy.types import Date, DateTime, Float, Integer, String
example_base = declarative_base()
class User(AbstractConcreteBase, example_base):
user_id = Column(Integer, autoincrement=True, primary_key=True)
username = Column(String(100))
salted_pw = Column(String(100))
created_at = Column(DateTime)
last_login = Column(DateTime)
age = Column(Integer)
height = Column(Float)
class Login(example_base):
__tablename__ = "logins"
_id = Column(Integer, primary_key=True, autoincrement=True)
login_timestamp = Column(DateTime)
user_id = Column(Integer, ForeignKey('special_users.user_id'))
class specialUser(User):
__tablename__ = 'special_users'
__mapper_args__ = {"polymorphic_identity": "special_users", "concrete": True}
special_id = Column(Integer, primary_key=True, autoincrement=True)
admin_access = Column(Boolean)
clearance_number = Column(Integer)
logins = relationship('Login', backref='user')
:pray: thank you for the snippet.
I you are familiar with PlantUML code, can you draw a class diagram in the online editor and share the link (make sure that the url to the diagram has been refreshed), please?
Honestly, I apprehend writing the parser that will handle such code :/
User.username
, should it be Column
, Column[String]
, String
or str
?builtins
ones (int
, bool
, str
, etc.) and do not even belong to the same module (sqlalchemy.Boolean
, sqlalchemy.types.Integer
): which ones do we want to see in the class diagram? The SQLAlchemy datatypes or the native underlying ones? (I would go for the SQLAlchemy datatypes, for simplicity sake)__tablename__
, __mapper_args__
, maybe several others): should they be included in the class diagram or omitted?specialUser.logins
, which make them hard to identify programmatically): should they be included in the class diagram or omitted?sqlalchemy.orm.declarative_base
function is imported (which can be done in several different ways), then called and its result assigned to a variable, then this variable is used as a parent class of the domain modelThis is not a simple feature you are asking for :worried:
@lucsorel
I've stumbled upon this project today (looking great, btw), and have the same problem as @Baukebrenninkmeijer. Here are my thoughts, hope they help.
AFAIK, this is the old (1.4) style of declaring models. This is still supported in version 2.0, but examples in the documentation seem to prefer the new style declaration.
I'd think 1.4 is still more often seen in the wild, but it might make sense to skip 1.4 style and immediately support the new declarative syntax. The new declarative API works via hinting native types, and is therefore much more amenable to inspection, which should make implementing this a lot easier.
That should at least take care of caveats 1 and 2. And RE: 3 my personal opinion is that it would be nice, but it's much better to have a way to automatically generate a diagram without that information, than to have no way at all.
For comparison, I've ported the sample to the newer 2.0 syntax.
from datetime import datetime
from typing import Optional
from sqlalchemy import ForeignKey, create_engine
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, relationship
from sqlalchemy.types import String
class ExampleBase(DeclarativeBase):
pass
class User(ExampleBase):
__abstract__ = True
user_id: Mapped[int] = mapped_column(primary_key=True)
username: Mapped[Optional[int]] = mapped_column(String(100))
salted_pw: Mapped[Optional[int]] = mapped_column(String(100))
created_at: Mapped[Optional[datetime]] = mapped_column()
last_login: Mapped[Optional[datetime]] = mapped_column()
age: Mapped[Optional[int]] = mapped_column()
height: Mapped[Optional[float]] = mapped_column()
class Login(ExampleBase):
__tablename__ = "logins"
_id: Mapped[int] = mapped_column(primary_key=True)
login_timestamp: Mapped[Optional[datetime]] = mapped_column()
user_id: Mapped[Optional[int]] = mapped_column(ForeignKey("special_users.user_id"))
class SpecialUser(User):
__tablename__ = "special_users"
__mapper_args__ = {"concrete": True}
special_id: Mapped[int] = mapped_column(primary_key=True)
admin_access: Mapped[Optional[bool]] = mapped_column()
clearance_number: Mapped[Optional[int]] = mapped_column()
logins = relationship("Login", backref="user")
if __name__ == "__main__":
# check generated SQL with this
engine = create_engine(f"sqlite://", echo=True)
ExampleBase.metadata.create_all(engine)
Compared to the old style example, this produces identical SQL code as far as I can see. As a side note, concrete table inheritance as used in this example is usually discouraged, and more of an interesting edge case I'd argue.
Caveat emptor:
I don't know how py2puml handles imports, as I have never used it before, so I'm using a very naive approach for this. Adjust accordingly.
Also, I'm generally somewhat mediocre when it comes to UML, so please excuse if this is not ideal. For example, if there's a need to fully model the imports, I have absolutely no idea how to deal with metaclasses in UML.
@startuml
package sqlalchemy.orm {
class DeclarativeBase
}
class ExampleBase
class User {
user_id: int
username: Optional[int]
salted_pw: Optional[int]
created_at: Optional[datetime]
last_login: Optional[datetime]
age: Optional[int]
height: Optional[float]
}
class Login {
_id: int
login_timestamp: Optional[datetime]
user_id: Optional[int]
}
class SpecialUser {
special_id: int
admin_access: Optional[bool]
clearance_number: Optional[int]
logins: Any
}
DeclarativeBase <|-- ExampleBase
ExampleBase <|-- User
ExampleBase <|-- Login
User <|-- SpecialUser
@enduml
This doesn't take care of properly typing relationships at all, and doesn't look for the __abstract__
member to figure out that User
is abstract.
@startuml
package sqlalchemy.orm {
class DeclarativeBase
}
class ExampleBase
abstract class User {
user_id: int
username: Optional[int]
salted_pw: Optional[int]
created_at: Optional[datetime]
last_login: Optional[datetime]
age: Optional[int]
height: Optional[float]
}
class Login {
_id: int
login_timestamp: Optional[datetime]
user_id: Optional[int]
}
class SpecialUser {
special_id: int
admin_access: Optional[bool]
clearance_number: Optional[int]
logins: Login <>
}
DeclarativeBase <|-- ExampleBase
ExampleBase <|-- User
ExampleBase <|-- Login
User <|-- SpecialUser
SpecialUser o-- Login
@enduml
This correctly identifies the relationship, and that User
is abstract. Getting the abstract classes right should be easy, since one only has to look for an __abstract__
member that is set True
, and possibly __mapper_args__
in the inheriting classes.
I guess implementing the relationship typing would be doable. SpecialUser.logins
is a relationship, so SpecialUser.logins.property
has a type of Relationship
.
Getting the association right would be quite hard though. You'd have to remember the __tablename__
for every class, and then take a look at e.g. SpecialUser.logins.property.local_remote_pairs
, and try to go from there.
Another potentially easier solution for this Issue would be to add an option to not follow imports, like mypy, but I don't know how feasible that is, given py2puml's design.
I'd be willing to spend some time to help work on this, but honestly, I don't think I'd be much help implementing anything on the py2puml side of things So if there's some groundwork I could do, or assist in another way to make this easier for you, I'd be happy to help.
@FynnFreyer Thanks! Yes, this is exactly the elaboration needed, and is extremely useful!
In terms of skipping all <2.0 version, I also considered that but given that most codebases will not have migrated and will use an older version, my preference would then be to see if we can support both. With the new 2.0 mapper system, things get easier using typehinting but extracting a type from an SQLAlchemy object was fairly easy before as well, since it had mapping to python types (i.e. model.__table__.columns[0].type.pythonType
).
The diagram I would have created would be the same as you created above, @FynnFreyer. BTW I copied the dunder tags from what we are using at work - i'm not entirely sure about whether they are required, such as the mapper args.
@Baukebrenninkmeijer
I fear you're right. Version two is quite new, and SQLAlchemy quite old, so only supporting 2.0 might not cut it in a lot of cases.
Using model.__table__.columns[i].type.python_type
seems like a pretty good solution here, and works equally well with both kinds of usage, I just checked. I didn't see this attr before, but in that case I don't think only supporting version 2.0 is a big improvement over just using that.
I guess the biggest conceptional issue left would be point 5, programatically identifying that this is an SQLAlchemy domain model. A quick and dirty way would be to check whether model.__class__.__name__
evaluates to the metaclass that SQLAlchemy models use ('DeclarativeAttributeIntercept'
).
One more thing I just noticed: you'd have to check model.__table__.columns[i].nullable
as well, to identify whether this is an optional value or not.
Thank you @Baukebrenninkmeijer and @FynnFreyer for the thorough description of your need and hints to a solution :pray:
However, most of the hints you gave rely on code inspection (actually loading the module - meaning that the python
interpreter actually runs them, identifying the classes, checking some class attributes like model.__table__.columns[i].type.python_type
, model.__table__.columns[i].nullable
, model.__class__.__name__
) and performing some heuristics to filter in only the relevant attributes (the ones which hold data) and to filter out the ones related to relationships (which would be replaced by inheritance or association UML arrows).
Code inspection is a great and powerful feature of Python and it is what I used when I started this tool because it was so convenient. But it is also painful and risky (see https://github.com/lucsorel/py2puml/issues/59#issuecomment-1720155485: the 3rd bullet mentions that the python
interpreter actually runs the code and some unexpected side effect can happen like writing in database or removing files, etc. - I don't want to be responsible for that). I had a the idea of removing code inspection and rely only on abstract syntax tree (AST) parsing to do the job, but it is a contributor who actually created the #59 issue to get rid of code inspection.
Some downsides of AST parsing is that it is harder to implement heuristics and py2puml
features, that py2puml
will miss class attributes added dynamically (by decorators, for example). But as a maintainer, I would feel safer when I know that py2puml
does not execute the code that it inspects.
It seems to me that extracting the domain model from an SQLAlchemy model should be feasible with AST parsing, but with different heuristics relying on type annotations (SQLAlchemy 2+) or on the expression assigned to the class fields (SQLAlchemy 1.4); what do you think?
@lucsorel These are 100% fair considerations and I think it should be doable with just the AST as well. Given that for the 1.4 version, it's mentioned in column and for 2.0 it's mentioned as a quasi-normal python type, do you think it will be significantly more complex than parsing other types of code? Compared with pydantic for example, they also have the type as a typehint similarly to sqlalchemy 2.0, so my expectation is that similar approaches can be leverages.
In terms of sequentiality and effort, I'd start with the 2.0 implementation, given my additional expectation of that being easier since it uses typehints. We will in the (hopeful) near future also migrate to 2.0 so maybe that should just be the way to go.
For additional future proofing, naturally support for 1.4 would be nice, but I think this is, given the circumstances, the better approach forward.
do you think it will be significantly more complex than parsing other types of code? Compared with pydantic for example?
It should not be very different; but for now, such classes (where an abstract class or a decorator - like @dataclass
- turns the class attributes into instance attributes) are analyzed through code inspection (by regarding the contents of MyClass.__annotations__
). Only class constructors (the __init__(self, ...)
functions) are analyzed through AST parsing for now, and it was quite tedious to detect all the types of self.an_attribute = an_attribute
assignments.
We'll see :smiley:
Issue to add support for SQLAlchemy models. This will include considerations about 1-1, 1- and -* relationships. Additionally, it should also consider inheritance.