marshmallow-code / marshmallow

A lightweight library for converting complex objects to and from simple Python datatypes.
https://marshmallow.readthedocs.io/
MIT License
7.01k stars 627 forks source link

Empty strings as None (revisited) #713

Open csingley opened 6 years ago

csingley commented 6 years ago

I'd like to revisit this suggestion - allowing empty strings to be interpreted as None:

https://github.com/marshmallow-code/marshmallow/issues/543

I understand the sentiment of the response: "it's a bad idea to support that". However, that's not sufficient for some use cases.

For example, right now I'm trying to use marshmallow to convert an XML format where null values are sent as empty strings. I don't get to control the XML format.

I think it's entirely reasonable for a Field to interpret a null datestring (or a null decimal string) as None, and control via allow_none whether to pass that. I don't think it's reasonable to fall down and say "the NULL value defined by your format is invalid, and we're not going to allow you to express that". This just forces a lot of ugly preprocessing workarounds onto library users, who need to clutter their code with superfluous, bug-prone value = value or None type declarations... thereby mixing format definition code in the middle of application code, which makes it hard to use marshmallow to write good code for these use cases.

I feel that the right place to define a format is, well, in the Schema class definition. A patch to accomplish this is simple:

We can stop there, and allow users to support their nasty unhygienic formats by overriding null_values in a custom subclass... or even take it a step further and allow first-class support for text-only formats by directly setting null_values = {None, ''} for fields.Decimal, fields.DateTime, fields.Date, fields.Time, etc. By Zeus, maybe even fields.Integer could treat empty strings as None!!

I'm happy to submit a PR to that effect, but I'd like to see where the developers' heads are at vis-a-vis using marshmallow for non-JSON formats before wasting anybody's time.

Thanks, and happy new year!

sersorrel commented 6 years ago

This would be really nice for parsing e.g. dates from HTML forms, where an empty <input type=date> (or any empty <input>) results in the empty string.

mgd722 commented 6 years ago

Seconded. This would be pretty handy for parsing empty <input> fields.

deckar01 commented 6 years ago

I think it makes sense for marshmallow to support this use case. We should explore marshmallow's existing functionality before we add a feature to the core.

Your use case:

The general solution for customizing deserialization behavior should be a custom field.

http://marshmallow.readthedocs.io/en/latest/custom_fields.html

from marshmallow import fields

class CustomField(fields.String):
    def _deserialize(self, value, attr, obj):
        if value == '':
            return None
        return super(CustomField, self)._deserialize(value, attr, obj)

Is there anything about this that doesn't work for your use case?

csingley commented 6 years ago

Writing custom field classes would work, of course... but if I have to write a custom Field subclass for every single data type, I'll just go ahead and roll my own type conversion/validation module and save myself the import. Indeed, that's what I ended up doing for the project where I was evaluating marshmallow.

It's a bit of a shame, because marshmallow has somef features that make it play nice with ElementTree (by accident I suppose). This is a strength that could be capitalized upon.

Taking a step back - the meaning of an empty string is defined globally by the protocol and/or parsing library.... "emtpy string is NULL" is a common choice. To map that protocol into Python data types, you want to write that equivalence once, globally... not write the same boilerplate ten times, once per validator. That's most inelegant.

Setting my use case aside, you've got users saying "hey wait a minute, HTML forms also use 'empty string is NULL' for lots of different data types". To me that sounds like a fairly key use case for marshmallow, and not handling it in the library seems like a bit of an oversight.

Just a friendly $0.02, because I see a lot to like in marshmallow. I probably would have used it for my project if this feature existed.

deckar01 commented 6 years ago

I don't think it is unreasonable to create a wrapper class for each of the fields you need to customize. It should be possible to create a mixin to make this clean and maintainable.

from marshmallow import fields

class NoneMixin(object):
    def _deserialize(self, value, attr, obj):
        if value == '':
            return None
        return super(NoneMixin, self)._deserialize(value, attr, obj)

class CustomStringField(NoneMixin, fields.String):
    pass

class CustomDecimalField(NoneMixin, fields.Decimal):
    pass

class CustomDateField(NoneMixin, fields.Date):
    pass

...

Would this work for the empty HTML input use case?

lafrech commented 6 years ago

The custom field described by @deckar01 only needs to be created once. Then, in the app, it can be used in place of the String field.

Agreed, this is far from perfect.

I think @csingley's proposal makes sense.

Wouldn't it be better, however, to treat empty strings as missing rather than None. I'm thinking of the HTML form use case, for instance.

  def deserialize(self, value, attr=None, data=None)
+     if value in self.null_values:
+         value = missing
        self._validate_missing(value)
    ...

Should this be used only on deserialization or on serialization as well? See this discussion about removing nullish values from output: https://github.com/marshmallow-code/marshmallow/issues/229.

csingley commented 6 years ago

I've got no skin in the game at this point, but from a user's point of view I think the mixin proposal is pretty reasonable (although not surprisingly I prefer my own approach... who doesn't?) Maybe a documentation enhancement outlining the NoneMixin pattern would be a good place to start?

@lafrech - re: missing vs None - that would work fine for my specific case, but these two things can sometimes require different handling. If the data protocol changes, and a new field is added, you might want to have your code do one thing if the field is missing (old protocol) vs. blank (new protocol, default value). Obviously this doesn't really apply to web forms, but I'd imagine in general you'd want to discriminate between the case where the field is empty and the case where it's simply not sent in the data stream.

Anyway, that's enough back seat driving from me! Cheers, and thanks for your work.

deckar01 commented 6 years ago

@anowlcalledjosh Did you come up with a solution for your use case?

@mgd722 Would a custom field be an acceptable solution for your use case? https://github.com/marshmallow-code/marshmallow/issues/713#issuecomment-378125776 or https://github.com/marshmallow-code/marshmallow/issues/713#issuecomment-378247878

sersorrel commented 6 years ago

@deckar01 I ended up using the following on my schemas:

@pre_load
def allow_null_dates(self, item):
    item["date"] = item["date"] or None
    item["enddate"] = item["enddate"] or None

Having to copy-paste subclasses seems a bit boilerplate-y, and makes integration with libraries like Marshmallow-SQLAlchemy (which automatically generates Marshmallow schemas) harder. Maybe for strings that's acceptable, but since the empty string isn't a valid date, I think dates should certainly turn "" into None.

mgd722 commented 6 years ago

Yes, a custom field could work for me but I do agree that it's rather inelegant. I ended up going with something like @anowlcalledjosh above to solve my issue. Since an empty string will never be a valid date, I'll second that (at least for dates) "" should end up as None.

DrPyser commented 5 years ago

A general solution could be to add a parameter to Field(available on all field types) to customize what input values are to be considered "null" or "missing".

schettino72 commented 5 years ago

Before finding this issue... I just created a fields.Integer subclass that has an attribute blank_as_none. It simple simple to implement but annoying as it needs to be done for all Numeric and Date types.

I can work on patch if there is interest on that..

schettino72 commented 5 years ago

Please see my proposed solution in the PR #1165. The implementation was pretty trivial and I guess it solves a pain point for many of us...

antoine-gallix commented 5 years ago

A demo on the proposed custom field solution.

from marshmallow import Schema, fields

# ---------------------Baseline---------------------

class SimpleSchema(Schema):
    s = fields.String()

simple_schema = SimpleSchema()

print(simple_schema.load({'s': None}))
# > UnmarshalResult(data={}, errors={'s': ['Field may not be null.']})

print(simple_schema.load({'s': ''}))
# > UnmarshalResult(data={'s': ''}, errors={})

# ---------------------empty as None, reject None---------------------

# Now trying to deserialize empty string as None, and don't accept None

class NotEmptyString(fields.String):
    """A String field type where empty string is deserialized to None"""

    def _deserialize(self, value, attr, data):
        if value == '':
            return None
        return super()._deserialize(value, attr, data)

class NotEmptySchema(Schema):
    s = NotEmptyString()

not_empty_schema = NotEmptySchema()

print(not_empty_schema.load({'s': None}))
# UnmarshalResult(data={}, errors={'s': ['Field may not be null.']})

print(not_empty_schema.load({'s': ''}))
# > UnmarshalResult(data={'s': None}, errors={})
# Value is translated to None, but it somehow breaks the process and result in a valid None

# ---------------------empty as None, accept None---------------------

# Now trying to deserialize empty string as None, but accept None

class NotEmptyButNoneOkSchema(Schema):
    s = NotEmptyString(allow_null=True)

not_empty_but_none_ok_schema = NotEmptyButNoneOkSchema()

print(not_empty_but_none_ok_schema.load({'s': None}))
# UnmarshalResult(data={}, errors={'s': ['Field may not be null.']})
# Here we even have regression. The custom field even overrides allow_null
# option. Not what I want.

print(not_empty_but_none_ok_schema.load({'s': ''}))
# > UnmarshalResult(data={'s': None}, errors={})
# Here is what I wanted

It seems that the allow_null option is applied before the deserialization. Making it impossible to work after a custom field that deserializes as None in some cases.

antoine-gallix commented 5 years ago

Basically, it's seems impossible to obtain the following behaviors:

charlax commented 5 years ago

This has been a huge source of frustration for me.

For instance, I'm trying to have an email field that allows None and "" (empty string). I believe an API should be liberal in accepting its input. If a field is optional, then a missing field, a null value or an empty string value should be considered legit.

So I'm trying to have this behavior throughout my codebase. In particular with the email fields, which by default does not allow it. This is what I had to do to allow empty emails:

class Email(fields.Email):
    def __init__(self, *args, **kwargs):
        kwargs.setdefault("allow_none", True)
        super().__init__(self, *args, **kwargs)

    def _validate(self, value):
        if not value:
            return
        super()._validate(value)

    def _validated(self, value):
        # This actually does nothing
        if not value:
            return value
        return super()._validated(value.strip().lower())

    def _deserialize(self, value, attr, obj):
        if not value:  # nocov
            return value
        # Strip and lowercase the email
        # NOTE: theoretically, the username part of email is not case
        # sensitive, but in practice, nobody cares about it (Gmail for
        # instance)
        return value.strip().lower()

(not mentioning the fact that overriding private methods feels a hack, but that's the pattern chosen by the library - IMO serialize and _serialize should have been inverted)

It would be great to have a better story around empty (missing, None or "") values for fields. This is very difficult when validators are involved.

def deserialize(self, value, attr=None, data=None, **kwargs):
        # Validate required fields, deserialize, then validate
        # deserialized value
        self._validate_missing(value)
        if value is missing_:
            _miss = self.missing
            return _miss() if callable(_miss) else _miss
        if getattr(self, "allow_none", False) is True and value is None:
            return None
        output = self._deserialize(value, attr, data, **kwargs)
        self._validate(output)
        return output

The problem in deserialize is that we don't have a way to bail out of validation and redefine value (like serialize does with its get_value).

lafrech commented 5 years ago

(not mentioning the fact that overriding private methods feels a hack, but that's the pattern chosen by the library - IMO serialize and _serialize should have been inverted)

serialize is meant to be called by another class, so it is "public".

_serialize is underscored because it is only meant to be called by its own class. It can be called and even overridden by subclasses. Hence the underscore meaning "protected".

"private" variables are double-underscored.

https://radek.io/2011/07/21/private-protected-and-public-in-python/ https://stackoverflow.com/questions/1641219/does-python-have-private-variables-in-classes

sloria commented 5 years ago

Let's revisit this issue post-3.0 release. I do think we want to have a story for validating empty strings.

taion commented 5 years ago

See also discussion in Django docs per https://docs.djangoproject.com/en/2.2/ref/models/fields/#field-options.

It would feel weird to deal with this in e.g. SQLAlchemy, though, so some sort of unification of dealing with empty/null logic in strings seems reasonable in Marshmallow.

sloria commented 5 years ago

I think it would be odd to treat empty strings equivalent to None; there are use cases where None means "not provided", as distinct from "".

I think we should control None and empty/blank values separately, as in Django. We could add an allow_empty parameter to Field and have empty_values set on the field class.

class Field(...):
    empty_values = {""}
    default_error_messages = {
       ...
        "empty": "Field may not be empty.",
        "null": "Field may not be null.",
    }

class List(...):
    empty_values = ([], ())

class Dict(...):
    empty_values = ({}, )

Thoughts?

sloria commented 5 years ago

You can achieve the above today with

from marshmallow import Schema, fields, validate

not_empty = validate.Length(min=1, error="Field may not be empty.")

class ArtistSchema(Schema):
    name = fields.Str(validate=not_empty)
    album_names = fields.List(fields.Str(), validate=not_empty)

ArtistSchema().load({"name": "", "album_names": []})
# marshmallow.exceptions.ValidationError: {'album_names': ['Field may not be empty.'], 'name': ['Field may not be empty.']}

So on one hand, the proposed API is redundant. On the other hand, it's a common enough use case that it may warrant first-class support.

sloria commented 5 years ago

Actually, I think I was misunderstanding the OP. IIUC, the goal is to interpret "" as missing and deserialize to None.

I think a more direct and flexible way to achieve this would be to allow specifying the values that should be interpreted as missing (as suggested by @lafrech in a previous comment).

# proposed API

class ArtistSchema(Schema):
    # if input is None, "", or is missing from the input, deserialize to None
    name = fields.Str(missing=None, missing_values={None, ""})

print(ArtistSchema().load({"name": ""}))  # {'name': None}

If instead your backend models "" as missing, you just change the value of missing.

class ArtistSchema(Schema):
    name = fields.Str(missing="", missing_values={None, ""})

This would allow any field to define what it means to be required. For example:

class ArtistSchema(Schema):
    album_names = fields.List(fields.Str(), required=True, missing_values=([], ()))

ArtistSchema().load({"album_names": []})
#  ValidationError: {'album_names': ['Missing data for required field.']}

What do you think?

schettino72 commented 5 years ago

I think a more direct and flexible way to achieve this would be to allow specifying the values that should be interpreted as missing

Indeed this approach would be more flexible. The problem is that it is very verbose. As OP work with XML or working with HTML forms, it is very common for blank string to used as missing value... In this solution you would need to add missing="", missing_values={None, ""} to all numeric and date fields, thats a lot of noise.

I have not even seen another use-case where people want other custom missing values, so maybe just dealing with the most common case would be better.

Another approach I would consider would be to specify missing_values at a global or schema level. This way it would be flexible as you wish and also reduce verbosity.

sloria commented 5 years ago

As OP work with XML or working with HTML forms, it is very common for blank string to used as missing value

Yes, it is common to model "" as missing, but there are also cases where None is used for missing (IIUC, this is what the OP wants).

See the relevant Django docs:

... it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL. One exception is when a CharField has both unique=True and blank=True set. In this situation, null=True is required to avoid unique constraint violations when saving multiple objects with blank values.

The proposed API cleanly supports both cases.


Another approach I would consider would be to specify missing_values at a global or schema level. This way it would be flexible as you wish and also reduce verbosity.

We could set defaults at the class level, so you could do

# proposed API

fields.Field.default_missing_values = ("", )

class ArtistSchema(Schema):
    name = fields.String(missing=None)
    dob = fields.DateTime(missing=None)

print(ArtistSchema().load({"name": "", "dob": ""}))
# => {'name': None, 'dob': None}

Edit: Use tuple instead of set for default_missing_values to avoid TypeErrors when checking against unhashable types.

csingley commented 5 years ago

Thanks for picking this up

Actually, I think I was misunderstanding the OP. IIUC, the goal is to interpret "" as missing and deserialize to None.

Yes, this is just it. There are plenty of data formats where the keys are fixed, and the values are all strings; in such a scheme, the only way to represent "no data" is an empty string.

Having the "no data" be configurable on the class is a win. Cheers.

sloria commented 5 years ago

@schettino72 What do you think about the default_missing_values solution above?

MarredCheese commented 5 years ago

@sloria The problem I see with that solution (if I'm understanding it correctly) is that it's incompatible with required=True. What if I want the following behavior?

schema.load({'some_field': None}) => {'some_field': None}
schema.load({'some_field': ''}) => {'some_field': None}
schema.load({}) => ValidationError: {'some_field': ['Missing data for required field.']}

The use case is differentiating between intentionally setting a field's value to null versus say accidentally botching a PUT request by leaving out a required field.

sloria commented 5 years ago

It's seems reasonable and common to represent "no data" in different ways ("", None, [], absence), but it seems less common (and perhaps unidiomatic) to have multiple ways to represent None. None and '' are not equivalent. If you really need '' to deserialize to None, perhaps you should use a custom field.

MarredCheese commented 5 years ago

@sloria At the database level, it is common to have string fields where you allow either NULLs or zero-length strings, but not both (to have a single, consistent way of representing "no data"). With that said, I would ask, if someone is going to bother using Marshmallow to convert some kind of input to a database-ready format, why wouldn't they consider having it automatically convert ZLS to NULL or vice versa? Why is that unidiomatic?

You correctly point out that None and '' are not always equivalent, but absence and None are also not always equivalent. Absence could mean "don't change the value," while None could mean "set the value to NULL" (e.g. a merge-style PATCH request). And as I stated above, absence could also be invalid and None could be intentional. Plus, Marshmallow's existing API already implies that absent and None are not always the same by allowing the argument combination required=True, allow_none=True.

I guess what I'm getting at is that I don't see the harm in accepting this PR (assuming it works) for times when Marshmallow users don't want to conflate absent and None.

In the meantime, if anyone reading this cares to see another idea, I tweaked this code above and came up with this:

@pre_load
def replace_empty_strings_with_nones(self, data, **kwargs):
    keys = ['project', 'model', 'inServiceDate']
    for key in keys:
        if key in data.keys():
            data[key] = data[key] or None
    return data
wschmrdr commented 4 years ago

I've been reading through this, as I've been having this issue myself, so there's some vested interest.

The argument was brought up that None and "" need to mean two different things. For something like String, or a similar type where you would expect "", this is understandable. However, we do need to also consider things like Integer, and other similar types. Even using the Number input type with standard HTML and AJAX in accordance with https://www.w3schools.com/html/tryit.asp?filename=tryhtml_input_number , if you submit nothing in the input box, the parameter will have a value of "". This creates a "Not a valid integer." validation error message in Marshmallow, when one would expect this to be interpreted as if it were None.

For the time being, I have a workaround on String of using a custom type, but it's something that would be nice to include as an option.

BTW, as a note to the last comment: You really want to be careful about doing things that way, especially with integers, since 0 and None should ALWAYS be two different things.

jrreich commented 3 years ago

I've been reading through this, as I've been having this issue myself, so there's some vested interest.

The argument was brought up that None and "" need to mean two different things. For something like String, or a similar type where you would expect "", this is understandable. However, we do need to also consider things like Integer, and other similar types. Even using the Number input type with standard HTML and AJAX in accordance with https://www.w3schools.com/html/tryit.asp?filename=tryhtml_input_number , if you submit nothing in the input box, the parameter will have a value of "". This creates a "Not a valid integer." validation error message in Marshmallow, when one would expect this to be interpreted as if it were None.

For the time being, I have a workaround on String of using a custom type, but it's something that would be nice to include as an option.

BTW, as a note to the last comment: You really want to be careful about doing things that way, especially with integers, since 0 and None should ALWAYS be two different things.

Did you ever find a good approach to handling empty strings as ints? This is exactly my current issue and a bit surprised to find there isn't a straightforward way to interpret "" as 0. Perhaps I'll look at subclassing fields.Integer but then since I'm trying to use it with SQLAlchemySchema I imagine I would have to overwrite the Integer Class.

wschmrdr commented 3 years ago

You don't necessarily want to interpret "" as 0, because 0 is a legitimate value that one could enter, whereas None, which is interpreted as "" in this case, means there is no value, and you definitely want to distinguish between the two. If you want 0 as a default value, that's a different story.

To answer your question of what I did, I made a MyInteger schema type that handles this accordingly, and then attach it to the attribute in my ModelSchema class (I'm using Flask and SQLAlchemy in addition to marshmallow). I usually only use it if the integer is nullable, as the error occurs at the DB upsert (update or insert), and if it's empty the required validator would usually catch it. In the _serialize and _deserialize methods, just check for "" or None, and make it a None; otherwise return the super, which should be able to handle "not valid" values, like "asdf". You don't need to overwrite the entire Integer class for your model, as this is happening when you load or dump the object, and SQLAlchemy knows how to deal with None just fine.