Closed alexrwix closed 5 years ago
Looking at this. Thanks.
Hi @alexrwix ,
I'm trying to understand how or why this would come up.
In an interactive console, I've used the py27 version of NDB to create and store an entity:
from google.appengine.ext import ndb
class SubKind(ndb.Model):
one = ndb.StringProperty()
two = ndb.StringProperty()
three = ndb.StringProperty()
class MyKind(ndb.Model):
foo = ndb.StructuredProperty(SubKind, repeated=True)
entity = MyKind(foo=[
SubKind(one="Hamlet", two="Othello"),
SubKind(three="Julius Caesar"),
])
entity.put()
The entity it creates in Datastore is:
| Name/ID | foo.one | foo.three | foo.two |
| id=5632891572715520 | ["Hamlet",null] | [null,"Julius Caesar"] | ["Othello",null] |
As you can see, missing values for subproperties are filled in with null
which is what I would intuitively expect. If it were to actually store lists of uneven length, it would be indeterminate which property values are supposed to go in which subentity. For example, let's say it had actually elided the null
values and stored this:
| Name/ID | foo.one | foo.three | foo.two |
| id=5632891572715520 | ["Hamlet"] | ["Julius Caesar"] | ["Othello"] |
That would get interpreted as having a single subentity where one
, two
, and three
are "Hamlet"
, "Julius Caesar"
, and "Othello"
respectively, which is not what we were trying to store.
Hewing more closely to your example:
entity = MyKind(foo=[
SubKind(one="first", two="second", three="third"),
SubKind(one="first-1", two="second-1", three="third-1"),
SubKind(three="third-2"),
])
entity.put()
We get:
| Name/ID | foo.one | foo.three | foo.two |
| id=5718229955641344 | ["first","first-1",null] | ["third","third-1","third-2"] | ["second","second-1",null] |
Because the nulls are included, this should work just fine with the code as-is.
Consider this variant:
entity = MyKind(foo=[
SubKind(one="first", two="second", three="third"),
SubKind(three="third-2"),
SubKind(one="first-1", two="second-1", three="third-1"),\
])
entity.put()
Which gets us:
| Name/ID | foo.one | foo.three | foo.two |
| id=5767558594560000 | ["first",null,"first-1"] | ["third","third-2","third-1"] | ["second",null,"second-1"] |
Here, if we were to elide the nulls such that foo.one
and foo.two
each only had two values, then we'd get back the wrong subentities when we tried to decode.
Is there a real-world use case I'm missing where NDB would be writing the subentities without null
spacers? If so, how does that happen, and what's the argument for assuming null
paddings at the end of the shorter sequences when it seems like they could go anywhere in the sequence?
Hi Chris. Yes of course. I'm checking python-ndb with real production content.
Our repeated structured property models inherited from ndb.Expando so the same models sometimes have a different set of properties.
Below is an example of real datastore entity
output property is:
class FileOutput(ndb.Expando):
items = ndb.StructuredProperty(FileOutputItem, repeated=True)
class FileOutputItem(ndb.Expando):
key = ndb.StringProperty()
type = ndb.StringProperty()
tag = ndb.StringProperty()
status = ndb.StringProperty()
url = ndb.StringProperty()
secure = ndb.BooleanProperty(default=False)
Here is an entitiy example loaded directly from datastore:
<Entity('Site',
'5d3777e1-3c08-460b-9a30-f8ac858713c9',
'SiteFile',
'19fc1a_3bc5448b27144c16bc6a5bf6270ba580'){
'output.items.quality': [
'480p',
None,
'144p',
'720p',
'1080p',
'720p',
'1080p',
'480p'
],
'date_created': datetime.datetime(2019,
6,
27,
9,
39,
54,
169515,
tzinfo=<UTC>),
'user_id': '19fc1a9e-4bdd-4018-a6b8-88499c81e7b5',
'output.items.url': [
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/480p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/sprite/manifest.json',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/144p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/720p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/1080p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/720p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/1080p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/480p/mp4/file.mp4',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f000.jpg',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f001.jpg',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f003.jpg'
],
'date_updated': datetime.datetime(2019,
6,
27,
9,
39,
54,
169397,
tzinfo=<UTC>),
'hash': '7c353c043ce1ced5ac12dea9f52a4c19',
'output.items.kind': [
None,
'manifest',
'video',
None,
None,
None,
None,
None,
None,
None,
None,
None
],
'output.items.type': [
'video',
'storyboard',
'storyboard',
'video',
'video',
'clip',
'clip',
'clip',
'image',
'image',
'image',
'image'
],
'input.video_bitrate': 7996143,
'output.items.fps': [
25,
None,
25.0,
25,
25,
25,
25,
25
],
'app_id': None,
'height': 1080,
'file_size': 6806321,
'icon_url': 'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
'input.display_aspect_ratio': '16:9',
'output.items.tag': [
'High',
None,
'Low',
'HD',
'HD',
'HD',
'HD',
'High',
None,
None,
None,
None
],
'output.items.height': [
480,
None,
144,
720,
1080,
720,
1080,
480,
1080,
1080,
1080,
1080
],
'file_name': '19fc1a_3bc5448b27144c16bc6a5bf6270ba580',
'input.duration': 6800,
'input.audio_bitrate': -1,
'output.items.status': [
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY'
],
'output.items.key': [
'480p.mp4',
'storyboard-sprite.json',
'storyboard-144p.mp4',
'720p.mp4',
'1080p.mp4',
'clip-720p.mp4',
'clip-1080p.mp4',
'clip-480p.mp4',
'poster.0',
'poster.1',
'poster.2',
'poster.3'
],
'mime_type': 'video/mp4',
'width': 1920,
'output.items.video_bitrate': [
1248686,
None,
100000,
3080326,
5991112,
2994678,
5911071,
1246376
],
'input.type': 'video',
'input.sample_aspect_ratio': '1:1',
'site_id': '5d3777e1-3c08-460b-9a30-f8ac858713c9',
'file_url': 'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/file',
'op_status': 'READY',
'output.items.duration': [
6760,
None,
6800,
6760,
6760,
6760,
6760,
6760,
None,
None,
None,
None
],
'state': '',
'input.fps': 25.0,
'output.items.format': [
'mp4',
'json',
'mp4',
'mp4',
'mp4',
'mp4',
'mp4',
'mp4',
'jpg',
'jpg',
'jpg',
'jpg'
],
'output.items.secure': [
False,
False,
False,
False,
False,
False,
False,
False,
False,
False,
False,
False
],
'output.items.audio_bitrate': [
-1,
None,
-1,
-1,
-1,
-1,
-1,
-1,
None,
None,
None,
None
],
'parent_folder_key': <Key('Site',
'5d3777e1-3c08-460b-9a30-f8ac858713c9',
'SiteFolder',
'media-root'),
project=wixprivatemedia-gptest>,
'tags': [
'_fileOrigin_uploaded'
],
'original_file_name': 'BGbreaking.mp4',
'output.items.display_aspect_ratio': [
'427:240',
None,
'16:9',
'16:9',
'16:9',
'16:9',
'16:9',
'427:240'
],
'output.items.width': [
854,
None,
256,
1280,
1920,
1280,
1920,
854,
1920,
1920,
1920,
1920
],
'input.height': 1080,
'ns': None,
'media_type': 'video',
'store': None,
'input.width': 1920,
'input.rotation': 0
}>
After loading with py37 python-ndb, output items array has 8 items instead of with 12
But the biggest problem is that this issue is INCONSISTENT! It depends on properties order in the entity that returned from datastore!
In case of this properties order, all items loaded correctly
<Entity('Site',
'5d3777e1-3c08-460b-9a30-f8ac858713c9',
'SiteFile',
'19fc1a_3bc5448b27144c16bc6a5bf6270ba580'){
'parent_folder_key': <Key('Site',
'5d3777e1-3c08-460b-9a30-f8ac858713c9',
'SiteFolder',
'media-root'),
project=wixprivatemedia-gptest>,
'tags': [
'_fileOrigin_uploaded'
],
'original_file_name': 'BGbreaking.mp4',
'output.items.display_aspect_ratio': [
'427:240',
None,
'16:9',
'16:9',
'16:9',
'16:9',
'16:9',
'427:240'
],
'output.items.width': [
854,
None,
256,
1280,
1920,
1280,
1920,
854,
1920,
1920,
1920,
1920
],
'input.height': 1080,
'ns': None,
'media_type': 'video',
'store': None,
'input.width': 1920,
'input.rotation': 0,
'output.items.quality': [
'480p',
None,
'144p',
'720p',
'1080p',
'720p',
'1080p',
'480p'
],
'date_created': datetime.datetime(2019,
6,
27,
9,
39,
54,
169515,
tzinfo=<UTC>),
'user_id': '19fc1a9e-4bdd-4018-a6b8-88499c81e7b5',
'output.items.url': [
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/480p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/sprite/manifest.json',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/storyboard/144p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/720p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/1080p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/720p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/1080p/mp4/file.mp4',
'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/clip/480p/mp4/file.mp4',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f000.jpg',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f001.jpg',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f003.jpg'
],
'date_updated': datetime.datetime(2019,
6,
27,
9,
39,
54,
169397,
tzinfo=<UTC>),
'hash': '7c353c043ce1ced5ac12dea9f52a4c19',
'output.items.kind': [
None,
'manifest',
'video',
None,
None,
None,
None,
None,
None,
None,
None,
None
],
'output.items.type': [
'video',
'storyboard',
'storyboard',
'video',
'video',
'clip',
'clip',
'clip',
'image',
'image',
'image',
'image'
],
'input.video_bitrate': 7996143,
'output.items.fps': [
25,
None,
25.0,
25,
25,
25,
25,
25
],
'app_id': None,
'height': 1080,
'file_size': 6806321,
'icon_url': 'media/19fc1a_3bc5448b27144c16bc6a5bf6270ba580f002.jpg',
'input.display_aspect_ratio': '16:9',
'output.items.tag': [
'High',
None,
'Low',
'HD',
'HD',
'HD',
'HD',
'High',
None,
None,
None,
None
],
'file_name': '19fc1a_3bc5448b27144c16bc6a5bf6270ba580',
'output.items.height': [
480,
None,
144,
720,
1080,
720,
1080,
480,
1080,
1080,
1080,
1080
],
'input.duration': 6800,
'input.audio_bitrate': -1,
'output.items.status': [
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY',
'READY'
],
'output.items.key': [
'480p.mp4',
'storyboard-sprite.json',
'storyboard-144p.mp4',
'720p.mp4',
'1080p.mp4',
'clip-720p.mp4',
'clip-1080p.mp4',
'clip-480p.mp4',
'poster.0',
'poster.1',
'poster.2',
'poster.3'
],
'mime_type': 'video/mp4',
'width': 1920,
'input.type': 'video',
'output.items.video_bitrate': [
1248686,
None,
100000,
3080326,
5991112,
2994678,
5911071,
1246376
],
'input.sample_aspect_ratio': '1:1',
'site_id': '5d3777e1-3c08-460b-9a30-f8ac858713c9',
'op_status': 'READY',
'file_url': 'video/19fc1a_3bc5448b27144c16bc6a5bf6270ba580/file',
'state': '',
'output.items.duration': [
6760,
None,
6800,
6760,
6760,
6760,
6760,
6760,
None,
None,
None,
None
],
'input.fps': 25.0,
'output.items.format': [
'mp4',
'json',
'mp4',
'mp4',
'mp4',
'mp4',
'mp4',
'mp4',
'jpg',
'jpg',
'jpg',
'jpg'
],
'output.items.secure': [
False,
False,
False,
False,
False,
False,
False,
False,
False,
False,
False,
False
],
'output.items.audio_bitrate': [
-1,
None,
-1,
-1,
-1,
-1,
-1,
-1,
None,
None,
None,
None
]
}
So, I've tweaked my py2.7 test script to use Expando
and leave three
undeclared, to be handled by the Expando
class:
from google.appengine.ext import ndb
class SubKind(ndb.Expando):
one = ndb.StringProperty()
two = ndb.StringProperty()
class MyKind(ndb.Expando):
foo = ndb.StructuredProperty(SubKind, repeated=True)
entity = MyKind(foo=[
SubKind(one="Hamlet", two="Othello"),
SubKind(three="Julius Caesar"),
])
entity.put()
Doing this, I do get an entity with uneven repeated property lengths, like you describe:
| Name/ID | foo.one | foo.three | foo.two |
| id=5644755899777024 | ["Hamlet",null] | ["Julius Caesar"] | ["Othello",null] |
The part that I'm still having trouble with is the fact that the subentities, in effect, are getting scrambled. If we just naively pad the end of the shorter property with nulls, we'll get the following subentities:
SubKind(one="Hamlet", two="Othello", three="Julius Caeser")
SubKind() # all properties are None
This, obviously, isn't what we tried to store and as far as I can tell is a bug, not a feature to be replicated. Unless I am missing something, the sub-entities in your production database are going to be somewhat mixed up with properties that were saved as being part of one sub-entity, being assigned to a different sub-entity upon retrieval.
Do you see what the problem is here? Is there something I'm missing?
Hi Chris. Yes. This specific example looks very problematic. It really looks like py27 ndb bug. In the current py37 version, we don't have such problem because it's saved as an embedded property. Am I right?
I have never tried such scenario. In our case, all "SubKind" constant properties always have values.
What is your suggestion? How loading problem can be resolved? We have hundreds of millions of records with such properties.
Hi Chris. Yes. This specific example looks very problematic. It really looks like py27 ndb bug. In the current py37 version, we don't have such problem because it's saved as an embedded property. Am I right?
Correct.
I have never tried such scenario.
As far as I can tell, your example is such a scenario.
In our case, all "SubKind" constant properties always have values.
In your example, quality
has 8 values but url
has 12 values. Therefore, there are 4 items that don't have quality
values. The issue is that we have no way of knowing which 4 items. We can no longer trust that quality
values match up to the correct items upon marshalling from Datastore. We don't have enough data to figure that out. This is what I mean by your item data being "scrambled".
What is your suggestion? How loading problem can be resolved? We have hundreds of millions of records with such properties.
That is a good question. I don't know yet. We probably need to have some internal discussion. I believe I understand the problem, but want to discuss with my team what the best approach for a solution might be.
Thank you. Looking forward to your solution.
There is an additional issue with py27 ndb models with dotted property names. Related to https://github.com/googleapis/python-ndb/pull/126
In the case of a structured property which has repeated properties of different lengths, only the first two items are loaded.
For example:
Current result (py37 python-ndb):
TestRepeatedStructuredModel(key=Key('TestRepeatedStructuredModel', '123456'), int_prop=42, struct_prop=StructuredRepeatedPropModel(key=Key('StructuredRepeatedPropModel', None), items=[StructuredPropModel(key=Key('StructuredPropModel', None), one='first', three='third', two='second'), StructuredPropModel(key=Key('StructuredPropModel', None), one='first-1', three='third-1', two='second-1')]))
Expected result (py27 ndb):
TestRepeatedStructuredModel(key=Key('TestRepeatedStructuredModel', '123456'), int_prop=42, struct_prop=StructuredRepeatedPropModel(key=Key('StructuredRepeatedPropModel', None), items=[StructuredPropModel(key=Key('StructuredPropModel', None), one='first', three='third', two='second'), StructuredPropModel(key=Key('StructuredPropModel', None), one='first-1', three='third-1', two='second-1'), StructuredPropModel(key=Key('StructuredPropModel', None), one=None, three='third-2', two=None)]))
All working as expected, if all dotted repeated properties have the same length
Entity Protobuf:
Test example: