keleshev / ask-me-about-api-design

Me giving you advice on API design in your open-source project: create an issue to start discussion
8 stars 0 forks source link

(Re)implementing data validation vs. integrating with 3rd party libs? (marshmallow) #1

Open sloria opened 9 years ago

sloria commented 9 years ago

Project

https://github.com/sloria/marshmallow , a Python library for serializing and deserializing Python objects. Its primary use case is in building RESTful APIs, where the developer needs to validate client data, deserialize to an app-level structure, and serialize objects to JSON.

Problem

The issue I'm facing is detailed here: https://github.com/sloria/marshmallow/issues/85, but here's a summary:

Marshmallow allows users to pass custom validators to schema fields:

def validate_password(val):
    if len(val) < 6:
        raise ValidationError('Password is too short')

password = fields.String(validate=validate_password)

A user suggested that marshmallow include implementations of generic validators (e.g. Length, Range, etc.). My concern is that these validators have been implemented in many other libraries (including schema!). I'm wondering if it would be better to integrate with those libraries rather than include bespoke validators within marshmallow.

The arguments against integrating with 3rd-party libs are summarized by these comments:

https://github.com/sloria/marshmallow/pull/84#issuecomment-65934045

...every library raises its own exception and, in general, does things in a slightly different way. Moral of the story: we have to provide at least a minimum set of generic essential validators.

https://github.com/sloria/marshmallow/issues/85#issuecomment-65947596

We can't rely on third parties for this, validation is too important. It's not by chance that each library (e.g. WTForms, Colander, FormEncode, etc. etc.) implements them. I don't want to cope with their changes or bring them as a dependency of my project when the only thing I really want is to check the length of a simple string.

I would appreciate help in weighing the tradeoffs at play here.

keleshev commented 9 years ago

Hi Steven, thanks for asking!

Integrating with several validation libraries seems like a bad idea: even if you integrate with each of them, they will (most likely) not compose well. This could be confusing for users—why I can nest these two schemas, but not those two other schemas?

I would do one of the following:

  1. Pick a single 3rd-party library and support only it.
  2. Write your own validators as a part of marshmallow, maybe extracting them into a separate library in future.

If you already have a favorite validation library—pick (1), otherwise (2). I would most likely go with (2) since it would give more freedom and less friction (no need for pull-requests if you want to change the validation part).

This sums up what I think about your question. However, I have some more unsolicited advice.

Unsolicited advice

On schema validation

If you go with implementing your own validators, please try to make them as ordinary Python expressions if possible. Instead of this:

class ArtistSchema(Schema):
    name = fields.Str()

class AlbumSchema(Schema):
    title = fields.Str()
    release_date = fields.Date()
    artist = fields.Nested(ArtistSchema)

I would rather want to see something like this:

class ArtistSchema(Schema):
    name = Schema(str)

class AlbumSchema(Schema):
    title = Schema(str)
    release_date = Schema(date)
    artist = Schema(Artist)

I'm obviously biased here, because of my schema library, but the point is—people already know what str and date are—this makes it easier for the users of your library. They don't need to remember if it's fields.Str or fields.String or fields.String().

However, if you follow this, and use types as schemas, you would need to be able to link from type to it's schema. I would probably do that with some kind of artificial magic attribute:

class ArtistSchema(Schema):
    __schema_type__ = Artist
    name = Schema(str)

This way the relation between type and schema is more clear, and not implicitly based on make_object method. Then to find all schemas and their related types you could use Schema.__subclasses__().

Other minor advice

Being able to throw in a quick predicate lambda without need to raise ValidationError would be nice.

I would make schema objects singletons, so you don't need to instantiate them before use: UserSchema.load({'email': 'foo'}) instead of UserSchema().load({'email': 'foo'}). Instead of passing options to the constructor, I would make them class methods, like UserSchema.load_collection(...) instead of UserSchema(many=True).load.

Why not make required=True the default? I was assuming that fields are required right until I read in the docs that they are not. It's a similar problem with SQLAlchemy nullable=True default that is not obvious, and I've seen many people making mistakes because of that.

Also, I would make schemas always strict=True. I don't think that returning a tuple result, error is Pythonic. And there's no disadvantage of raising exception, as long as you provide a structured error together with exception message. Compare:

# Now
user, error = UserSchema().load(data)
if error:
    return error, 400
else:
    return user, 200
# Proposed
try:
    return UserSchema.load(data), 200
except ValidationError as error:
    return error.json, 400

Also, I wish there was a way for an attribute with one name to map to a JSON key with different name, like with SQLAlchemy (attribute name vs. column name):

class Thing(Base):
    created_at = Column('timestamp', DateTime)

Maybe it's possible, but I didn't look in the advanced-usage docs.

Another thing that makes me uncomfortable is that marshmallow is a lot of code, and has very big API surface.

Positives

On positive note, I really like that you can use plain objects with your schemas, and those objects don't need to know anything about serialization/validation—very elegant. Also I have never seen a library like that before—that combines together serialization and validation. I didn't think those two were related, but apparently they are—and you exploit this relationship very well. Marshmallow totally beats the need for as_json methods, and other ugly things that I'm committing myself regularly.

Disclamer

I didn't put much thought in naming things in this advice—like __schema_type__ and load_collection. These names need more thought, if you decide to follow those paths.

sloria commented 9 years ago

Thank you for the thorough response! I really appreciate that you went beyond just the original question.

Integrating with several validation libraries seems like a bad idea: even if you integrate with each of them, they will (most likely) not compose well.

I should clarify--I would not integate 3rd-party libraries within marshmallow itself; this would happen in a separate utility library. I drafted a proof-of-concept in this gist: https://gist.github.com/sloria/edffc11fac7ee680e544 . Basically, you'd have converter functions that would convert 3rd-party validators to marshmallow validators.

from wtforms.validators import NoneOf

foo = fields.Field(validate=[
        from_wtforms(NoneOf(['nil', 'null'])),
    ])

What I've decided for the next feature release is to rewrite the current validators as class-based validators to simplify their usage.

What is still unclear, however, is whether to expand the scope of the marshmallow.validate module beyond its current state.

...please try to make them as ordinary Python expressions if possible.

A good idea, and one that I've considered. And I suppose it could be done in backwards-compatible way (e.g. fields.Str() would be equivalent to Schema(str)). I'm wondering, though, if there are cases when a field/schema may not have a one-to-one mapping with a type. I recall a user implementing a polymorphic field. Another example is QuerySelect, which can handle multiple types (which may have no ancestry with each other).

Being able to throw in a quick predicate lambda without need to raise ValidationError would be nice.

This is possible:

password = fields.Str(validate=lambda p: len(p) > 8, error='Must be greater than 8 characters.')

I was considering deprecating this functionality, though, because a lambda validator isn't very useful without the error parameter, which adds a redundant point of API. Thoughts?

I would make schema objects singletons, so you don't need to instantiate them before use.

I'm not sure I like the idea of adding a new method for every option; that would add a lot of cognitive load, no?

Allowing multiple instances of a schema makes a schema much more reuseable. Need filtering in your HTTP API? Just pass request arguments to your Schema:

item = #...
requested_fields = request.query['fields']
schema = ItemSchema(only=requested_fields)
return schema.dump(item).data

Why not make required=True the default?

I can see how optional fields as the default might not meet user expectation. I will consider this.

I don't think that returning a tuple result, error is Pythonic.

Perhaps. My concern with raising an exception by default is that the user would then have to learn the interface of the exception (error.json, in your example) in order to get at the error messages. With a tuple, they just receive two dicts.

Also, I wish there was a way for an attribute with one name to map to a JSON key with different name, like with SQLAlchemy (attribute name vs. column name):

Here's how it's done in marshmallow:

created_at = fields.DateTime(attribute='timestamp')

marshmallow is a lot of code, and has very big API surface.

marshmallow weighs at ~1.3K LOC, which is within the ballpark of libraries of similar scope: colander is ~1.2K LOC, django-rest-framework's serializers + fields modules are at ~1.4K LOC.

Keeping marshmallow's API small and focused is a major concern, though, which is what brought me here in the first place. =)

Also I have never seen a library like that before—that combines together serialization and validation.

I can't take credit for the idea. marshmallow is heavily influenced by Django-REST-Framework, Flask-RESTful, and--to a slightly lesser extent--colander, WTForms, and schematics.