sqlalchemy / sqlalchemy

The Database Toolkit for Python
https://www.sqlalchemy.org
MIT License
9.67k stars 1.44k forks source link

Suggestion: Adding serialization to_csv(), to_json(), to_yaml(), to_dict() for all db models #4482

Closed AnderUstarroz closed 5 years ago

AnderUstarroz commented 5 years ago

@cmodzelewski-ms has done a pretty decent job with the package SQLAthanor adding serialization methods within SQLAlchemy instances:

Nonetheless I find a bit dirty having to replace SQLAlchemy modules to implement it, when this should ideally be a native feature of SQLAlchemy. Wouldn't be nice merging both projects?

I am pretty sure thousands of programmers and API developers will be very grateful for this feature.

insightindustry commented 5 years ago

First off, thanks! I’m glad you like SQLAthanor.

Now to your bigger question, there’s a pretty fundamental philosophical question as to whether it makes sense to merge SQLAthanor with SQLAlchemy.

On the one hand, merging them is a far cleaner and convenient solution for long term serialization/deserialization support in SQLAlchemy. But on the other hand, keeping them separate keeps each package smaller and more focused.

Separation of concerns allows both packages focus on what they do well: SQLAlchemy which provides DB connectivity (directly or via an ORM), and SQLAthanor which provides serialization support.

Merging SQLAthanor’s codebase into the SQLAlchemy codebase is definitely possible, but does it offer any advantages besides being a bit more convenient in the long run? If so, what what are those advantages?

I’m really not certain whether it is worth giving up on the benefits provided by separation of concerns in exchange for the convenience of having a single package. Any other advantages / trade offs I may be missing or misunderstanding?

AnderUstarroz commented 5 years ago

Thanks for the quick answer, From a maintainer point of view maybe you are right, and is easier to maintain that way, but from a developer point of view the advantages of the integration are evident, as it would allow to simply use any_db_instance.to_json() without having to worry about wasting time searching external packages and checking their reliability. For example an application based on Django Rest perform all these JSON transformations without having to waste a minute.

SQLAthanor replaces and extends SQLAlchemy's modules in order to add these features, but wouldn't make more sense having them out of the box within SQLAlchemy? it definitely wouldn't make sense for other serializers like Marshmallow or Colander because they are framework-agnostic, but knowing that SQLAthanor was specially (and uniquely) designed for SQLAlchemy, wouldn't that make it the perfect bride?

Maybe it would be a good idea to leave this thread open for a while, just to see what others developers/maintainers think about the matter.

zzzeek commented 5 years ago

Hi there -

unfortunately, merging any third party extension with SQLAlchemy is a total non-starter, SQLAlchemy's ongoing goals are to reduce the size of the library, not to expand it. To the degree any third party system introduces awkardness due to the fact that it's not part of SQLAlchemy itself, that's a bug, either because SQLAlchemy isn't providing enough integration points or because the 3rd party library isn't using them.

Beyond that, I am the sole maintainer of SQLAlchemy, so anything that gets added to the project by definition becomes my job to maintain forever, so even if we were looking to expand the library to go with a "batteries included" approach like Django, this would not be sustainable with our current organization. Django is broken up into lots of separately maintained sub-components. We don't have an organization like that here to sustain that kind of thing, external components need to be their own projects for now. At most, maybe some projects would want to come into the new github.com/sqlalchemy organization but that mostly means they're buying into a specific kind of development and testing model.

In this case, I took a look briefly at https://sqlathanor.readthedocs.io/en/latest/using.html#import-sqlathanor to understand what @AnderUstarroz meant by "I find a bit dirty having to replace SQLAlchemy modules to implement it,", and in fact that's exactly what he means, SQLAthanor seems to provide its own imports that replace SQLAlchemys, which I would assume is how it intercepts the creation of a model in order to determine how that model is shaped.

So, this is going to be blunt, and I apologize, but @AnderUstarroz is absolutely correct here, and the solution is not to merge SQLAthanor into SQLAlchemy, the solution is to rewrite SQLAthanor, at least the part of it that is building its own versions of Column and relationship, and instead use SQLAlchemys APIs for mapper inspection. There should be no need to import symbols that are wrappers for SQLAlchemy's version of those objects. SQLAlchemy's existing objects should be as introspectable as possible and there are also tons of event hooks that can be used to intercept when they are created and used. If SQLAthanor needs more introspection or event hooks, if they are appropriate then they can be added, or some other kind of hook that solves the problem in a different way if determined to be more appropriate.

Any serialization library for SQLAlchemy need only do one thing in order to build a schema from a model, which is to use the inspection system when needed to determine how a particular model is shaped. The inspection system can be used at the moment it is first needed or it can be used for all new models as they are created using the event hooks. The demo below illustrates one way to use both of these hooks together to build schemas for mapped classes as they are created and then to serialize into json using that schema.

Links:

Available inspection targets InstanceState InstanceState.attrs mapper.attrs


import json

from sqlalchemy import event
from sqlalchemy import inspect
from sqlalchemy.orm import mapper

def _column_schema(expression):
    return expression.type.python_type

def _relationship_schema(relationship):
    return (relationship.uselist, relationship.mapper)

def _create_schema_from_mapping(mapper, seen):
    if mapper in seen:
        return seen[mapper]
    seen[mapper] = schema = {}

    # this is kind of a made-up "schema" system, a real tool would do something
    # much more comprehensive
    for attr in mapper.column_attrs:
        schema[attr.key] = _column_schema(attr.expression)
    for attr in mapper.relationships:
        schema[attr.key] = _relationship_schema(attr)

    return schema

schemas = {}

@event.listens_for(mapper, "mapper_configured")
def _pregen_schema_for_mapper(mapper, class_):
    _create_schema_from_mapping(mapper, schemas)

def model_to_json(model):
    model_state = inspect(model)
    mapper = model_state.mapper
    schema = schemas[mapper]

    # again kind of a made up serializer showing how to kind of read from the
    # schema at the same time, a real tool would be more comprehensive
    result = {
        key: [model_to_json(elem) for elem in model_state.attrs[key].value]
        if isinstance(value, tuple) and value[0]
        else model_to_json(model_state.attrs[key].value)
        if isinstance(value, tuple) and not value[0]
        else model_state.attrs[key].value
        for key, value in schema.items()
    }

    return json.dumps(result, indent=2)

if __name__ == "__main__":
    from sqlalchemy import Column, Integer, String, ForeignKey, create_engine
    from sqlalchemy.orm import relationship, Session
    from sqlalchemy.ext.declarative import declarative_base

    Base = declarative_base()

    class A(Base):
        __tablename__ = "a"

        id = Column(Integer, primary_key=True)
        data = Column(String)
        bs = relationship("B")

    class B(Base):
        __tablename__ = "b"
        id = Column(Integer, primary_key=True)
        a_id = Column(ForeignKey("a.id"))
        data = Column(String)

    e = create_engine("sqlite://")
    Base.metadata.create_all(e)
    s = Session(e)

    a1 = A(data="foobar", bs=[B(data="b1"), B(data="b2")])
    s.add(a1)
    s.commit()

    print(model_to_json(a1))
AnderUstarroz commented 5 years ago

Seems like we both were partially right @insightindustry:

This last point was actually the main reason why I suggested to merge packages, replacing SQLAlchemy modules just didn't feel right, and I though that merging the packages will get rid of those hacky imports, nonetheless following Mike’s advice and using SQLAlchemy API/inspection would be a transparent and cleaner solution.

Thanks for such an exhaustive comment @zzzeek, the level of detail of you answers never ceases to amaze me. Some of your reflections in Stackoverflow are stilll stuck in my head, and jaysus! that was more than 5 years ago 😄

zzzeek commented 5 years ago

OK so, what I left out was, all those paramters SQLAthenor adds to Column and such which I figure is the other reason they are like that. So yes, loading all those paramters in to column like this:


class User(BaseModel):
  __tablename__ = 'users'

  id = Column('id',
              Integer,
              primary_key = True,
              supports_csv = True,
              csv_sequence = 1,
              supports_json = True,
              supports_yaml = True,
              supports_dict = True,
              on_serialize = None,
              on_deserialize = None,
              display_name = None)

that's mixing concerns inside the object, adding a huge amount of keyword arguments (what I sometimes call the "flags and switches" approach to API design) and it does not scale. All the objects we are dealing with here have a dictionary called .info on them which is there for the purpose of third party libraries to put whatever they want in this dictionary. As for API, it can be used directly, or after the fact. Directly looks like this:

Column('x', Integer, info={"my_key": "my_value"})

SQLAthenor won't want to do that. It should establish mutator functions that populate the dictionary with SQLAthenor-private state directly:

from sqlathenor import serializable

class MyClass(Base):
    #  ...

    data = Column(String)
    serializable(data, supports_json=True, supports_csv=True, csv_sequence=1)

    widgets = relationship("Widget")
    serializable(widgets, supports_json=True)

the serializable function looks like:

def serializable(obj, **kw):
    if "my_serializer" not in obj.info:
        obj.info["my_serializer"] = {}
    obj.info["my_serializer"].update(kw)

or you could go with a wrapper approach

from sqlathenor import serializable

class MyClass(Base):
    #  ...

    data = serializable(Column(String), supports_json=True, supports_csv=True, csv_sequence=1

    widgets = serializable(relationship("Widget"), supports_json=True)

I think the after the fact approach is cleaner. you can also set up all the serilization separately in the class def that way:

    class A(Base):
        __tablename__ = "a"

        id = Column(Integer, primary_key=True)

        data = Column(String)
        bs = relationship("B")

        serializable(data, supports_json=True)
        serializable(bs, supports_json=True)

it seems like all of these supports_json, supports_yaml etc. flags default to False (which seems like an odd choice, since it seems like part of the point of this is to automate the creation of the schema == less typing, why would I want to_yaml(my_object) to fail if I call it?) , but that means someone is going to be typing lots of flags. Id rather type a lot less, and do something like this:

    class A(Base):
        __tablename__ = "a"

        id = Column(Integer, primary_key=True)

        data = Column(String)
        x = Column(Integer)
        y = Column(Integer)
        bs = relationship("B")

       # with this version you can stick a sort_key in each attribute too 
        serializeable(            
            data, x, y, bs, 
            supports_json=True
        )

or just class level, since I think that's how most people would want this to work for the vast majority of attributes (right?)

    @serializable(supports_json=True, supports_yaml=True)
    class A(Base):
        __tablename__ = "a"

        id = Column(Integer, primary_key=True)

        data = Column(String)
        x = Column(Integer)

        # lets add additional things to how we serialize the 'y' attribute
        y = Column(Integer)
        serializable(y, on_serialize=lambda val: val + 5, on_deserialize=lambda val: val - 5)

        bs = relationship("B")

there's a lot of python idioms that can be used here, I think minimization of repetition and composition over keyword arguments are good things to focus on.

note that for all the APIs I propose here, none of them would suggest "hey maybe SQLAthenor should be merged into SQLAlchemy", it's obvious that these two things are separate and there would be nothing to gain by merging the two APIs into one. That's the big problem with the "redefine the imports and add many new arguments" model, it blurries the lines and creates confusion.

insightindustry commented 5 years ago

Thanks, @zzzeek! I am completely in agreement with your arguments against any kind of merging between SQLAlchemy and SQLAthanor, and you articulated the reasons much more clearly than I did, so thanks for that.

In terms of your comments on SQLAthanor’s API, I’m also deeply grateful for your suggestions there as well. I spent a looong time thinking about how to model an API that supports the practical nuances of serialization/de-serialization while being minimally (or ideally, completely not invasive within SQLAlchemy’s API), and supporting comprehensive serialization/de-serialization functionality with “minimal typing”.

The real reason to intercept/extend SQLAlchemy’s code is to support exactly those keyword arguments which you identified. I wrestled looong and hard with the design decision of whether to do that, and you might have noticed that SQLAthanor actually supports two very different serialization configuration patterns: a declarative pattern (which intercepts and extends SQLAlchemy objects) and a meta pattern which is more robust/elegant/clean (and does not need to intercept anything / can be implemented so as to leverage mixins and meta models).

I thought long and hard whether to support both approaches, and while the purist in me strongly prefers the meta approach which does not need to do anything in terms of extending or modifying SQLAlchemy’s APIs, ultimately I decided to strike a balance between an API that works conveniently for “quick-and-dirty” use cases (as are commonly the case in data science work) and for more robust class design.

For more robust / elegant / clean design, I far prefer the “meta approach” to configuring serialization/deserialization at the class level as documented here.

Since lots of folks have asked why support both approaches, I’ve even included a brief explanation of my reasoning in the docs.

Generally, feedback that I’ve gotten from users suggests that the preference for declarative/meta approach depends heavily on use case (quick-and-dirty vs more robust) and there are plenty who prefer one over the other based on their style and specific situation.

Other things like defaulting to supports_json = (False, False) seems counterintuitive for simple serialization use cases - but in practice when designing applications, it is a handy security feature: one wouldn’t want to serialize a (even encrypted) password attribute/column “by accident”, for example. Therefore I opted to ensure that things get serialized/deserialized explicitly and never implicitly...unless using the by-definition explicit dump_to_<format>() method.

I’m going to give a fair bit of thought to the serializable() pattern you suggest. There might be a good compromise there which allows for a cleaner design and which can minimize coupling with SQLAlchemy’s code (of course, migrating to a wholly new declarative approach will introduce backwards compatibility issues for existing users, which means that functionality would need to be deprecated over some amount of time before being cleaned out of the code base, but that’s a manageable concern).

Thanks for giving me a lot to think about! I’m deeply grateful for any other thoughts or suggestions you may have. And if at some point you do decide that a serialization/deserialization extension for SQLAlchemy would be useful, I’m happy to share any learnings as to the nuances of serialization/deserialization use cases (as with anything, the devil is in the details).

zzzeek commented 5 years ago

Ah right, if you default to True, then you have the security concern. maybe a flag that notes fields as security sensitive agnostic of serialization chosen, but either way, that's why this kind of API is very open to subjective judgment and should be it's own thing.

I think finding a midway between the "meta" and "declarative" configurations, and making that just the one way to configure, is a great idea. The "meta" config is how I normally do these things but I'd probably try to make it a little more succinct looking, so perhaps some of the ideas I've suggested hit that middle ground.

and of course in all cases if you change anything, you leave the old ways around for compatibiltiy but I would recommend following a deprecation schedule as I'm having to take a lot of old things out of SQLAlchemy in 1.3 I was too lazy to get around to over the years.