googleapis / python-ndb

Apache License 2.0
150 stars 66 forks source link

older style structured properties issue #129

Closed alexrwix closed 5 years ago

alexrwix commented 5 years ago

There is an additional issue with py27 ndb models with dotted property names. Related to https://github.com/googleapis/python-ndb/pull/126

In the case of a structured property which has repeated properties of different lengths, only the first two items are loaded.

For example:

{
        "int_prop": 42,
        "struct_prop.items.one": ["first", "first-1"],
        "struct_prop.items.two": ["second", "second-1"],
        "struct_prop.items.three": ["third", "third-1", "third-2"]
}

Current result (py37 python-ndb): TestRepeatedStructuredModel(key=Key('TestRepeatedStructuredModel', '123456'), int_prop=42, struct_prop=StructuredRepeatedPropModel(key=Key('StructuredRepeatedPropModel', None), items=[StructuredPropModel(key=Key('StructuredPropModel', None), one='first', three='third', two='second'), StructuredPropModel(key=Key('StructuredPropModel', None), one='first-1', three='third-1', two='second-1')]))

Expected result (py27 ndb): TestRepeatedStructuredModel(key=Key('TestRepeatedStructuredModel', '123456'), int_prop=42, struct_prop=StructuredRepeatedPropModel(key=Key('StructuredRepeatedPropModel', None), items=[StructuredPropModel(key=Key('StructuredPropModel', None), one='first', three='third', two='second'), StructuredPropModel(key=Key('StructuredPropModel', None), one='first-1', three='third-1', two='second-1'), StructuredPropModel(key=Key('StructuredPropModel', None), one=None, three='third-2', two=None)]))

All working as expected, if all dotted repeated properties have the same length

{
        "int_prop": 42,
        "struct_prop.items.one": ["first", "first-1"],
        "struct_prop.items.two": ["second", "second-1"],
        "struct_prop.items.three": ["third", "third-1"]
}

Entity Protobuf:

{
  partition_id {
    project_id: "kuk"
  }
  path {
    kind: "TestRepeatedStructuredModel"
    name: "123456"
  }
}
properties {
  key: "int_prop"
  value {
    integer_value: 42
  }
}
properties {
  key: "struct_prop.items.one"
  value {
    array_value {
      values {
        string_value: "first"
      }
      values {
        string_value: "first-1"
      }
    }
  }
}
properties {
  key: "struct_prop.items.three"
  value {
    array_value {
      values {
        string_value: "third"
      }
      values {
        string_value: "third-1"
      }
      values {
        string_value: "third-2"
      }
    }
  }
}
properties {
  key: "struct_prop.items.two"
  value {
    array_value {
      values {
        string_value: "second"
      }
      values {
        string_value: "second-1"
      }
    }
  }
}

Test example:

@ndb_client_context()
    def test_entity_with_dotted_repeated_structured_property_different_length(self):
        db_driver = DbDriver(kind="TestRepeatedStructuredModel")

        ds_key = '123456'
        data = {
            "int_prop": 42,
            "struct_prop.items.one": ["first", "first-1"],
            "struct_prop.items.two": ["second", "second-1"],
            "struct_prop.items.three": ["third", "third-1", "third-2"]
        }

        db_driver.put(ds_key, data)

        key = ndb.Key(TestRepeatedStructuredModel, ds_key)

        retrieved = key.get()
        print(f'retrieved: {retrieved}')

        assert retrieved.int_prop == 42
        assert len(retrieved.struct_prop.items) == 3   // Error: Returning 2 items

        assert key.delete() is None
        assert key.get() is None

        db_driver.clear()
chrisrossi commented 5 years ago

Looking at this. Thanks.

chrisrossi commented 5 years ago

Hi @alexrwix ,

I'm trying to understand how or why this would come up.

In an interactive console, I've used the py27 version of NDB to create and store an entity:

from google.appengine.ext import ndb

class SubKind(ndb.Model):
    one = ndb.StringProperty()
    two = ndb.StringProperty()
    three = ndb.StringProperty()

class MyKind(ndb.Model):
    foo = ndb.StructuredProperty(SubKind, repeated=True)

entity = MyKind(foo=[
    SubKind(one="Hamlet", two="Othello"),
    SubKind(three="Julius Caesar"),
])

entity.put()

The entity it creates in Datastore is:

  | Name/ID | foo.one | foo.three | foo.two |  
  | id=5632891572715520 | ["Hamlet",null] | [null,"Julius Caesar"] | ["Othello",null] |  

As you can see, missing values for subproperties are filled in with null which is what I would intuitively expect. If it were to actually store lists of uneven length, it would be indeterminate which property values are supposed to go in which subentity. For example, let's say it had actually elided the null values and stored this:

  | Name/ID | foo.one | foo.three | foo.two |  
  | id=5632891572715520 | ["Hamlet"] | ["Julius Caesar"] | ["Othello"] |  

That would get interpreted as having a single subentity where one, two, and three are "Hamlet", "Julius Caesar", and "Othello" respectively, which is not what we were trying to store.

Hewing more closely to your example:

entity = MyKind(foo=[
    SubKind(one="first", two="second", three="third"),
    SubKind(one="first-1", two="second-1", three="third-1"),
    SubKind(three="third-2"),
])

entity.put()

We get:

  | Name/ID | foo.one | foo.three | foo.two |  
  | id=5718229955641344 | ["first","first-1",null] | ["third","third-1","third-2"] | ["second","second-1",null] |  

Because the nulls are included, this should work just fine with the code as-is.

Consider this variant:

entity = MyKind(foo=[
    SubKind(one="first", two="second", three="third"),
    SubKind(three="third-2"),
    SubKind(one="first-1", two="second-1", three="third-1"),\
])

entity.put()

Which gets us:

  | Name/ID | foo.one | foo.three | foo.two |  
  | id=5767558594560000 | ["first",null,"first-1"] | ["third","third-2","third-1"] | ["second",null,"second-1"] |  

Here, if we were to elide the nulls such that foo.one and foo.two each only had two values, then we'd get back the wrong subentities when we tried to decode.

Is there a real-world use case I'm missing where NDB would be writing the subentities without null spacers? If so, how does that happen, and what's the argument for assuming null paddings at the end of the shorter sequences when it seems like they could go anywhere in the sequence?

alexrwix commented 5 years ago

Hi Chris. Yes of course. I'm checking python-ndb with real production content.

Our repeated structured property models inherited from ndb.Expando so the same models sometimes have a different set of properties.

Below is an example of real datastore entity

output property is:

class FileOutput(ndb.Expando):
    items        = ndb.StructuredProperty(FileOutputItem, repeated=True)

class FileOutputItem(ndb.Expando):
    key          = ndb.StringProperty()
    type         = ndb.StringProperty()
    tag          = ndb.StringProperty()
    status       = ndb.StringProperty()
    url          = ndb.StringProperty()
    secure       = ndb.BooleanProperty(default=False)

Here is an entitiy example loaded directly from datastore:

<Entity('Site',
'5d3777e1-3c08-460b-9a30-f8ac858713c9',
'SiteFile',
'19fc1a_3bc5448b27144c16bc6a5bf6270ba580'){
  'output.items.quality': [
    '480p',
    None,
    '144p',
    '720p',
    '1080p',
    '720p',
    '1080p',
    '480p'
  ],
  'date_created': datetime.datetime(2019,
  6,
  27,
  9,
  39,
  54,
  169515,
  tzinfo=<UTC>),
  'user_id': '19fc1a9e-4bdd-4018-a6b8-88499c81e7b5',
  'output.items.url': [
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/480p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/sprite/manifest.json',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/144p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/720p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/1080p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/720p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/1080p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/480p/mp4/file.mp4',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f000.jpg',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f001.jpg',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f003.jpg'
  ],
  'date_updated': datetime.datetime(2019,
  6,
  27,
  9,
  39,
  54,
  169397,
  tzinfo=<UTC>),
  'hash': '7c353c043ce1ced5ac12dea9f52a4c19',
  'output.items.kind': [
    None,
    'manifest',
    'video',
    None,
    None,
    None,
    None,
    None,
    None,
    None,
    None,
    None
  ],
  'output.items.type': [
    'video',
    'storyboard',
    'storyboard',
    'video',
    'video',
    'clip',
    'clip',
    'clip',
    'image',
    'image',
    'image',
    'image'
  ],
  'input.video_bitrate': 7996143,
  'output.items.fps': [
    25,
    None,
    25.0,
    25,
    25,
    25,
    25,
    25
  ],
  'app_id': None,
  'height': 1080,
  'file_size': 6806321,
  'icon_url': 'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
  'input.display_aspect_ratio': '16:9',
  'output.items.tag': [
    'High',
    None,
    'Low',
    'HD',
    'HD',
    'HD',
    'HD',
    'High',
    None,
    None,
    None,
    None
  ],
  'output.items.height': [
    480,
    None,
    144,
    720,
    1080,
    720,
    1080,
    480,
    1080,
    1080,
    1080,
    1080
  ],
  'file_name': '19fc1a_3bc5448b27144c16bc6a5bf6270ba580',
  'input.duration': 6800,
  'input.audio_bitrate': -1,
  'output.items.status': [
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY'
  ],
  'output.items.key': [
    '480p.mp4',
    'storyboard-sprite.json',
    'storyboard-144p.mp4',
    '720p.mp4',
    '1080p.mp4',
    'clip-720p.mp4',
    'clip-1080p.mp4',
    'clip-480p.mp4',
    'poster.0',
    'poster.1',
    'poster.2',
    'poster.3'
  ],
  'mime_type': 'video/mp4',
  'width': 1920,
  'output.items.video_bitrate': [
    1248686,
    None,
    100000,
    3080326,
    5991112,
    2994678,
    5911071,
    1246376
  ],
  'input.type': 'video',
  'input.sample_aspect_ratio': '1:1',
  'site_id': '5d3777e1-3c08-460b-9a30-f8ac858713c9',
  'file_url': 'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/file',
  'op_status': 'READY',
  'output.items.duration': [
    6760,
    None,
    6800,
    6760,
    6760,
    6760,
    6760,
    6760,
    None,
    None,
    None,
    None
  ],
  'state': '',
  'input.fps': 25.0,
  'output.items.format': [
    'mp4',
    'json',
    'mp4',
    'mp4',
    'mp4',
    'mp4',
    'mp4',
    'mp4',
    'jpg',
    'jpg',
    'jpg',
    'jpg'
  ],
  'output.items.secure': [
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False
  ],
  'output.items.audio_bitrate': [
    -1,
    None,
    -1,
    -1,
    -1,
    -1,
    -1,
    -1,
    None,
    None,
    None,
    None
  ],
  'parent_folder_key': <Key('Site',
  '5d3777e1-3c08-460b-9a30-f8ac858713c9',
  'SiteFolder',
  'media-root'),
  project=wixprivatemedia-gptest>,
  'tags': [
    '_fileOrigin_uploaded'
  ],
  'original_file_name': 'BGbreaking.mp4',
  'output.items.display_aspect_ratio': [
    '427:240',
    None,
    '16:9',
    '16:9',
    '16:9',
    '16:9',
    '16:9',
    '427:240'
  ],
  'output.items.width': [
    854,
    None,
    256,
    1280,
    1920,
    1280,
    1920,
    854,
    1920,
    1920,
    1920,
    1920
  ],
  'input.height': 1080,
  'ns': None,
  'media_type': 'video',
  'store': None,
  'input.width': 1920,
  'input.rotation': 0
}>

After loading with py37 python-ndb, output items array has 8 items instead of with 12

But the biggest problem is that this issue is INCONSISTENT! It depends on properties order in the entity that returned from datastore!

In case of this properties order, all items loaded correctly

<Entity('Site',
'5d3777e1-3c08-460b-9a30-f8ac858713c9',
'SiteFile',
'19fc1a_3bc5448b27144c16bc6a5bf6270ba580'){
  'parent_folder_key': <Key('Site',
  '5d3777e1-3c08-460b-9a30-f8ac858713c9',
  'SiteFolder',
  'media-root'),
  project=wixprivatemedia-gptest>,
  'tags': [
    '_fileOrigin_uploaded'
  ],
  'original_file_name': 'BGbreaking.mp4',
  'output.items.display_aspect_ratio': [
    '427:240',
    None,
    '16:9',
    '16:9',
    '16:9',
    '16:9',
    '16:9',
    '427:240'
  ],
  'output.items.width': [
    854,
    None,
    256,
    1280,
    1920,
    1280,
    1920,
    854,
    1920,
    1920,
    1920,
    1920
  ],
  'input.height': 1080,
  'ns': None,
  'media_type': 'video',
  'store': None,
  'input.width': 1920,
  'input.rotation': 0,
  'output.items.quality': [
    '480p',
    None,
    '144p',
    '720p',
    '1080p',
    '720p',
    '1080p',
    '480p'
  ],
  'date_created': datetime.datetime(2019,
  6,
  27,
  9,
  39,
  54,
  169515,
  tzinfo=<UTC>),
  'user_id': '19fc1a9e-4bdd-4018-a6b8-88499c81e7b5',
  'output.items.url': [
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/480p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/sprite/manifest.json',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/144p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/720p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/1080p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/720p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/1080p/mp4/file.mp4',
    'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/480p/mp4/file.mp4',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f000.jpg',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f001.jpg',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
    'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f003.jpg'
  ],
  'date_updated': datetime.datetime(2019,
  6,
  27,
  9,
  39,
  54,
  169397,
  tzinfo=<UTC>),
  'hash': '7c353c043ce1ced5ac12dea9f52a4c19',
  'output.items.kind': [
    None,
    'manifest',
    'video',
    None,
    None,
    None,
    None,
    None,
    None,
    None,
    None,
    None
  ],
  'output.items.type': [
    'video',
    'storyboard',
    'storyboard',
    'video',
    'video',
    'clip',
    'clip',
    'clip',
    'image',
    'image',
    'image',
    'image'
  ],
  'input.video_bitrate': 7996143,
  'output.items.fps': [
    25,
    None,
    25.0,
    25,
    25,
    25,
    25,
    25
  ],
  'app_id': None,
  'height': 1080,
  'file_size': 6806321,
  'icon_url': 'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
  'input.display_aspect_ratio': '16:9',
  'output.items.tag': [
    'High',
    None,
    'Low',
    'HD',
    'HD',
    'HD',
    'HD',
    'High',
    None,
    None,
    None,
    None
  ],
  'file_name': '19fc1a_3bc5448b27144c16bc6a5bf6270ba580',
  'output.items.height': [
    480,
    None,
    144,
    720,
    1080,
    720,
    1080,
    480,
    1080,
    1080,
    1080,
    1080
  ],
  'input.duration': 6800,
  'input.audio_bitrate': -1,
  'output.items.status': [
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY',
    'READY'
  ],
  'output.items.key': [
    '480p.mp4',
    'storyboard-sprite.json',
    'storyboard-144p.mp4',
    '720p.mp4',
    '1080p.mp4',
    'clip-720p.mp4',
    'clip-1080p.mp4',
    'clip-480p.mp4',
    'poster.0',
    'poster.1',
    'poster.2',
    'poster.3'
  ],
  'mime_type': 'video/mp4',
  'width': 1920,
  'input.type': 'video',
  'output.items.video_bitrate': [
    1248686,
    None,
    100000,
    3080326,
    5991112,
    2994678,
    5911071,
    1246376
  ],
  'input.sample_aspect_ratio': '1:1',
  'site_id': '5d3777e1-3c08-460b-9a30-f8ac858713c9',
  'op_status': 'READY',
  'file_url': 'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/file',
  'state': '',
  'output.items.duration': [
    6760,
    None,
    6800,
    6760,
    6760,
    6760,
    6760,
    6760,
    None,
    None,
    None,
    None
  ],
  'input.fps': 25.0,
  'output.items.format': [
    'mp4',
    'json',
    'mp4',
    'mp4',
    'mp4',
    'mp4',
    'mp4',
    'mp4',
    'jpg',
    'jpg',
    'jpg',
    'jpg'
  ],
  'output.items.secure': [
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    False
  ],
  'output.items.audio_bitrate': [
    -1,
    None,
    -1,
    -1,
    -1,
    -1,
    -1,
    -1,
    None,
    None,
    None,
    None
  ]
}
chrisrossi commented 5 years ago

So, I've tweaked my py2.7 test script to use Expando and leave three undeclared, to be handled by the Expando class:

from google.appengine.ext import ndb

class SubKind(ndb.Expando):
    one = ndb.StringProperty()
    two = ndb.StringProperty()

class MyKind(ndb.Expando):
    foo = ndb.StructuredProperty(SubKind, repeated=True)

entity = MyKind(foo=[
    SubKind(one="Hamlet", two="Othello"),
    SubKind(three="Julius Caesar"),
])

entity.put()

Doing this, I do get an entity with uneven repeated property lengths, like you describe:

  | Name/ID | foo.one | foo.three | foo.two |  
  | id=5644755899777024 | ["Hamlet",null] | ["Julius Caesar"] | ["Othello",null] |  

The part that I'm still having trouble with is the fact that the subentities, in effect, are getting scrambled. If we just naively pad the end of the shorter property with nulls, we'll get the following subentities:

SubKind(one="Hamlet", two="Othello", three="Julius Caeser")
SubKind()  # all properties are None

This, obviously, isn't what we tried to store and as far as I can tell is a bug, not a feature to be replicated. Unless I am missing something, the sub-entities in your production database are going to be somewhat mixed up with properties that were saved as being part of one sub-entity, being assigned to a different sub-entity upon retrieval.

Do you see what the problem is here? Is there something I'm missing?

alexrwix commented 5 years ago

Hi Chris. Yes. This specific example looks very problematic. It really looks like py27 ndb bug. In the current py37 version, we don't have such problem because it's saved as an embedded property. Am I right?

I have never tried such scenario. In our case, all "SubKind" constant properties always have values.

What is your suggestion? How loading problem can be resolved? We have hundreds of millions of records with such properties.

chrisrossi commented 5 years ago

Hi Chris. Yes. This specific example looks very problematic. It really looks like py27 ndb bug. In the current py37 version, we don't have such problem because it's saved as an embedded property. Am I right?

Correct.

I have never tried such scenario.

As far as I can tell, your example is such a scenario.

In our case, all "SubKind" constant properties always have values.

In your example, quality has 8 values but url has 12 values. Therefore, there are 4 items that don't have quality values. The issue is that we have no way of knowing which 4 items. We can no longer trust that quality values match up to the correct items upon marshalling from Datastore. We don't have enough data to figure that out. This is what I mean by your item data being "scrambled".

What is your suggestion? How loading problem can be resolved? We have hundreds of millions of records with such properties.

That is a good question. I don't know yet. We probably need to have some internal discussion. I believe I understand the problem, but want to discuss with my team what the best approach for a solution might be.

alexrwix commented 5 years ago

Thank you. Looking forward to your solution.