numberoverzero / bloop

Object Mapper for DynamoDB
http://bloop.readthedocs.io
MIT License
58 stars 12 forks source link

Support Nullable strings #118

Closed trondhindenes closed 6 years ago

trondhindenes commented 6 years ago

As far as I understand, supporting None (empty) strings currently requires creating a custom type class. It would be awesome if bloop supported this natively. Feels to me that nulling a string is such a common thing that while Bloop is a super-flexible library it would be awesome to have it builtin.

I'm afraid I don't understand enough of the type system in bloop to be able to properly define a nullable type.

numberoverzero commented 6 years ago

Thanks for opening an issue. To make sure I understand, what would the equivalent call be in boto3 to either PutItem or UpdateItem?

I remember DynamoDB didn't support empty strings for a while, so I'd even less expect a typed {"S": null} to be supported:

In that last link, there's this comment on the String type:

Strings are Unicode with UTF-8 binary encoding. The length of a string must be greater than zero and is constrained by the maximum DynamoDB item size limit of 400 KB.

trondhindenes commented 6 years ago

We solve it using a custom NullableString type:

class NullableString(String):
    def __init__(self):
        super().__init__()

    def dynamo_dump(self, value, *, context, **kwargs):
        if value is None:
            return None
        elif value == '':
            return None
        return value

    def dynamo_load(self, value, *, context, **kwargs):
        if value is None:
            return None
        elif value == '':
            return None
        value = super().dynamo_load(value, context=context, **kwargs)
        return value

This seems to work, but I have to admit I haven't really read up on the underlying DynamoDB reference, and we haven't done extensive testing of the above code yet either.

numberoverzero commented 6 years ago

That's pretty good, and it works as long as you are comfortable treating empty string "" and lack of value None as the same. In some cases, this isn't a safe assumption (eg. migrating from a nullable column to non-null user-provided field that may be an empty string).

Bloop handles the distinction by treating anything that's set to None as an explicit instruction to clear that column in DynamoDB during the UpdateItem call (engine.save) and other types, like Set, use this to coerce an empty set to None when saving the item. There's a brief description of this here but that section could stand to be expanded.

Right now I believe Bloop will fail if you try to save an empty string (throwing a boto3 or botocore exception) because the String type won't automatically translate "" to None. If it did that, Bloop would do what you're expecting and clear the column when the string value is empty locally.

I'd recommend the following definition for NullableString:

class NullableString(bloop.String):
    def dynamo_dump(self, value, *, context, **kwargs):
        if value:
            return value
        return None

This should be enough since empty strings are Falsey, and dynamo_load will take care of the rest automatically (empty strings will never be returned from DynamoDB, so the elif branch of your dynamo_load function won't be hit).

trondhindenes commented 6 years ago

good point, I'll adjust my code to match.

numberoverzero commented 6 years ago

Quick update: I've got a few days of time and decided to sort through the issue backlog. It's been long enough since I made the decision to preserve the distinction between Null and "" that I thought it worth revisiting. tl;dr I'm going to make the code sample above the default implementation for String in 2.2, and deal with the fallout in 3.0 should DynamoDB ever support empty strings.


A few points surfaced that I didn't bring up in the last comment from May:

Given the probability that empty strings will be supported in the future, I'm going to incorporate the code sample above into Bloop directly, so that an empty string won't try to be saved to DynamoDB (which would fail) and will instead emit the correct "delete" instruction during UpdateItem. At the same time, any column with the String type that isn't present will load in as "" in the same way that a Set type that has no value loads in as set().

Applying the criteria from the bug fix appendix this change can be made in a minor version, since (1) a user has noticed the unintended behavior (empty string currently throws exceptions from botocore) and (2) fixing the behavior in bloop by no longer passing the empty strings forward will not break a user's reasonable workaround since the only possible workarounds would cause the exceptions to stop (which this fix would also accomplish).

If the comment here comes to pass and DynamoDB does start supporting empty string, I expect enough cascading changes between the implementation and the SDKs which now have to reflect those changes that Bloop can either handle the distinction in a flag, or migrate in 3.0 with a deprecation notice. Reverting the change won't pass criteria (2) for the bug fix appendix, which is why it would have to come with 3.0 and not 2.x.


Reproduction of https://news.ycombinator.com/item?id=13173927:

I used to work on the DynamoDB team. Throwaway account because my normal account can be tied back to my real name.

Each hash key resolves to a number of possible servers the data can be on. Data is replicated across several of these servers. For redundancy. The hash key determines which shard to use.

On individual machines, each set of data is stored by a compound key of hash key and sort key (if there is a sort key). The data is probably stored on disk sequentially by sort key or close to it.

This is pretty much exactly correct. The hash key maps to a quorum group of 3 servers (the key is hashed, with each quorum group owning a range of the hash space). One of those 3 is the master and coordinates writes as well as answering strongly consistent queries; eventually consistent queries can be answered by any of the 3 replicas.

They possibly use something like LevelDB for this.[1]

Sigh...if only. I don't remember the exact timeline but LevelDB either didn't exist when we started development or wasn't stable enough to be on our radar.

DynamoDB is this very elegant system of highly-available replication, millisecond latencies, Paxos distributed state machines, etc. Then at the very bottom of it all there's a big pile of MySQL instances. Plus some ugly JNI/C++ code allowing the Java parts to come in through a side door of the MySQL interface, bypassing most of the query optimizer (since none of the queries are complex) and hitting InnoDB almost directly.

There was a push to implement WiredTiger as an alternative storage engine, and migrate data transparently over time as it proved to be more performant. However, 10gen bought WiredTiger and their incentive to work with us vanished, as MongoDB was and is one of Dynamo's biggest competitors.

numberoverzero commented 6 years ago

Hey @trondhindenes,

With the release of 2.2 (#123) I believe this can be closed.

String and Binary will use empty values for None, so you should be able to remove the NullableString type above and everything will just work. Here are the new and updated docs:

If you think anything's missing or could be improved feel free to comment or re-open. Thanks for your feedback!

trondhindenes commented 6 years ago

Thank you again for your detailed responses and continous updates @numberoverzero - I'm learning lots by reading your replies :-). The docs look great, I'm fine with closing this. Thank you again for this!