Open mapsam opened 8 years ago
My understanding is that array and object values are not part of the specification, and if you want to use them you need to invent your own serialization protocol and ensure that the encoders and decoders in your part of the ecosystem understand it. In other words, there is no "should" here at the level of the VT spec.
Thanks @jfirebaugh - this is the case.
Mapbox Studio Classic and recent versions of Tippecanoe write them as stringified JSON, with no way to tell whether the value originally came from a string or an object.
with no way to tell whether the value originally came from a string or an object
Interesting @ericfischer. Does that mean parsers must try to parse as JSON first, then assume it's a string if the JSON cannot be parsed?
@mapsam I think most parsers just treat them as strings and don't make any attempt to reconstitute objects. But @ingalls may have more details on what ultimately happens with the values that prompted https://github.com/mapbox/tippecanoe/issues/155.
Ah, so it's up the user then to know their string/object-string values. Makes sense!
Here is one place this conversation is happening in mapbox-gl-js.
@mapsam our use case is such that we expect either an array or a stringified array given the property name so for us parsing is relatively easy.
@mapsam mentioned reopening this to have a convo about adding how to handle objects and arrays into the spec.
@flippmoke what are your thoughts on adding this in?
+1 to adding it in - multiple people are adding ways to represent lists (e.g. tilezen/mapbox-vector-tile#64)
I have been talking to a large number of people about this at SOTM 2016. The biggest hurdle to store the unlimited key-values (object/dictionary/hstore) and the array support is that Mapnik is fairly rigid in terms of non-simple-type columns, and the number of those columns, so adding this support to Mapnik would require substantial efforts. I would like to propose both the Mapnik workaround as well as the spec change.
To generate an object or an array in vtile using Mapnik, one could generate the data as a single JSON-encoded string value, and add a custom server-side vector post-processing step that decodes JSON and converts it into values. This way, "name": {"en":"London","hu":"London","fr":"Londres"}
would be stored as a long string, with multiple identical strings repeating, and the post-processor would convert it to 3 individual tags - "name:en":"London", "name:hu":"London", "name:fr":"Londres"
(thus deduping the values).
There are two ways to actually store it:
collate()
function needs to be implemented to handle multiple languages - collate(tags,"name:fr","name:hu","name:en")
would get the first existing value.The Value message should have two additional fields:
// A list of key-value pairs. Just like Feature's "tags", the repeated pairs
// of integers point to Layer's "keys" (1st integer) and "values" (2nd integer).
repeated uint32 object_value = 8 [ packed = true ];
// An array of values. Similar to Feature's "tags", each integer represents
// an index of the Layer's "values" array.
repeated uint32 list_value = 9 [ packed = true ];
So if the value contains object_value
, the odd integers are the indexes of the Layer's keys array, and the even indexes refer to the Layer's values array. Arrays are implemented the same way, only this time list_value indexes only refer to the Layer's values array.
One of the drawback is that it is now possible to have circular dependencies, so to avoid this and simplify client's data validation, data may only be two levels deep (root+one sublevel). Another unenforced but possibly valuable requirement might be to restrict all values in both object and list to have the same type.
Sorry for my lack of responses on this ticket: Let me summarize some thoughts I have on this currently.
I would like to propose both the Mapnik workaround as well as the spec change.
Eventually we do want to pull out all Mapbox vector tile encoding and decoding such that it is not required to be a part of Mapnik. In general mapnik-vector-tile issues should probably be discussed there. Here on the spec trying to specifically just cover ideas specific to the specification.
List support could be added by changing section 4.4 - which currently reads as follows:
Feature attributes are encoded as pairs of integers in the
tag
field of a feature. The first integer in each pair represents the zero-based index of the key in thekeys
set of thelayer
to which the feature belongs. The second integer in each pair represents the zero-based index of the value in thevalues
set of thelayer
to which the feature belongs. Every key index MUST be unique within that feature such that no other attribute pair within that feature has the same key index. A feature MUST have an even number oftag
fields. A featuretag
field MUST NOT contain a key index or value index greater than or equal to the number of elements in the layer'skeys
orvalues
set, respectively.
In theory we could allow keys not to be unique per a feature and if more then one occurs we assume the item to be a list. This is not ideal for compression necessarily but could solve the problem.
I do not think this change could qualify as anything but a major version incrementation (though we could debate on this).
Objects scare me quite a bit more because we suddenly are adding depth "values". This could be something that was very bad for performance overall if not implemented with thought and with some testing.
Example Implementation:
message Object {
// Same as feature tags currently
repeated uint32 tags = 2 [ packed = true ];
}
message Value {
// Exactly one of these values must be present in a valid message
optional string string_value = 1;
optional float float_value = 2;
optional double double_value = 3;
optional int64 int_value = 4;
optional uint64 uint_value = 5;
optional sint64 sint_value = 6;
optional bool bool_value = 7;
optional Object object_value = 8;
extensions 9 to max;
}
I am going to leave this ticket open for continued discussion. Thanks for the feedback everyone!
@nyurik part of the challenge here with both of our ideas is that that means that all non object values MUST be encoded first. This would put a variety of different requirements on both the encoders and decoders.
@flippmoke could you elaborate on your last comment? I'm not sure I understood.
@nyurik
Lets just assume the following structure:
{
"key1": value1(int)
"key2": value2(object) {
"key3": value3(object) {
"key4": value4(int),
"key5": value5(int)
},
"key6": value6(int)
},
"key7": value7(int)
}
Because Value
now has an "Object" that must know the index of keys and values. The order of encoding becomes critical.
value4
and value5
would have to be in the before value3
.value3
and value6
would have to be encoded before value2
.Edit: by encoded here I mean placed into values on the Layer.
// Dictionary encoding for keys
repeated string keys = 3;
// Dictionary encoding for values
repeated Value values = 4;
@flippmoke ah, yes, we (@pnorman and I) were discussing this. I don't think we actually want to allow unlimited nest-ness. It has very limited practical value, whereas having just one level of unlimited key-value storage has a huge practical benefit - primarily for storing all possible translations of a given object (key=name, value={language-code => string}. The moment we put this restriction in place, the whole system becomes much more manageable and easy to implement:
tags
field is being created now.If we're going to allow nesting at all, and it requires a version bump, why not allow arbitrary JSON objects?
Values are already tagged with an integer for their type, and only 1 through 7 are defined. So type 8 could denote that the value is an array (encoded as an integer array of sub-values) and type 9 denote that it is a hash (encoded as an integer array of sub-keys alternating with sub-values).
So a feature with this attribute:
"nested" : { "one" : "two", "three" : [ 4, 5, 6 ] }
would have these keys in the layer:
0. nested
1. one
2. three
and these values:
0. two (string)
1. 4 (int)
2. 5 (int)
3. 6 (int)
4. 1,2,3 (list)
5. 1,0,2,4 (hash)
with an outer key-value list of
0,5
I just realized that this was what @nyurik already said. Sorry for the needless duplication.
@ericfischer I'm not sure it's the same - I actually think that we should NOT allow arbitrary nested levels, as it would create too many encoding and decoding problems, while not giving us much practical benefits. After all, vector tiles is a very domain-specific data storage format. My proposal is either to not change the spec at all (except that now a vector tile would have any number of non-predefined keys), or, if we change the spec, we only allow two new types (list and object) at the top level. Neither list nor object will allow storing other lists or objects inside of them for the reasons outlined above.
Encoding example
{ "Features": [
{ "str": "aaa",
"dic": { "en": "aaa", "fr": "bbb"},
"arr": [ 100, 200 ]
}, {
"dic": { "en": "ccc", "gr": "bbb" },
"arr": [ 100, 300 ]
}
]}
VTile structure:
# 0 1 2 3 4 5
keys: [ "str", "en", "fr", "dic", "arr", "gr" ]
values: [
"aaa" (string) # 0
"bbb" (string) # 1
# key value pairs, value MUST NOT be object or array,
[1,0,2,1] (object) # 2 all value indexes MUST be less than the index of this element.
100 (int) # 3
200 (int) # 4
[3,4] (array) # 5 list of value indexes. Same restrictions as for Object
"ccc" (string) # 6
[1,6,5,1] (object) # 7
300 (int) # 8
[3,8] (array) # 9
]
features: [
{ tags: [0,0, 3,2, 4,5] },
{ tags: [3,7, 4,9] }
]
@flippmoke, I'm not so sure that repeating keys to make a list is better than adding a new value type. The decoder would need to do an additional key lookup for every new key, to make sure it hasn't seen it before (performance), and because the key needs to be repeated (compression). UPDATE: @flippmoke yes, the values contained in the object have to be encoded before the object itself (same for arrays). But because nesting is not allowed, this would not be a problem.
Maybe to make validation easier, there could be a rule that a value can only refer to values that are numbered lower than itself? This would prevent reference loops.
@ericfischer, while it would make the validation easier, it would also make encoding slightly slower and harder. Are there any practical reasons to allow infinite nesting? @pnorman mentioned that allowing nesting inside arrays would make multidimensional (Jagged) arrays possible, but I haven't seen a usage example yet.
I have seen one case where someone embedded an entirely different GeoJSON feature as an attribute of a GeoJSON feature. I don't know how common this need is in reality though.
From my perspective, guaranteeing that sub-values will be numbered below the value that refers to them seems easy, because you can't build your list of references until you have already assigned IDs to the sub-values, so the ID that then gets assigned to the list of references, once it is complete, will have to be greater than the IDs that were assigned earlier in the process to the things that went into it. But there might be reasons to want to write it the other way around instead. (I should try writing an implementation of this to make sure I'm not missing something important.)
@nyurik Whether or not it is a good idea, here is my experimental implementation of it in Tippecanoe: https://github.com/mapbox/tippecanoe/pull/303
@ericfischer, thanks. I will attempt to do the same with the https://github.com/mapbox/vector-tile-js .
While we are introducing a major change, question for proto-buf gurus: does it make sense to make keys
and values
to have lower ID number than features
? Would it force protobuf to store features at the end, ensuring that a parser could read content sequentially, instead of skipping features and then coming back? I just looked at the https://github.com/mapbox/pbf/blob/master/index.js#L183 - sees skipping values is a linear traversal.
@ericfischer @nyurik could you both drop sample .proto
files to make it easier to understand exactly what you are doing
@flippmoke https://github.com/mapbox/vector-tile-spec/pull/88 (keeping it in 2.1 for easy diffing). I am not sure if we need to introduce new message types.
@flippmoke @nyurik Here is my version of the .proto change. I also added a null type as 10 so that all the JSON types are covered.
I had used 8 for lists and 9 for hashes yesterday, but just switched them now to be consistent with @nyurik's version.
To generate an object or an array in vtile using Mapnik...
Mapnik faces certain limitations with generating MapBox Vector Tiles, but it's not the only generator implementation, and all generator implementations will require changes for encoding key=>val stores or arrays as property values.
Simple way that does not alter this spec ...
People are already stuffing strings with special meaning into property values, which hopefully we can avoid. Continuing to do this requires no changes to the spec, so isn't really relevant here.
Also, the space differences between 'name:lang' and '*:lang' are tiny in most cases, even more reduced by compression, and speak to a particular vector tile schema which others may not be using.
Based on discussions I've had with others, the two features needed are a key=>val store of strings and array support. In rough order of significance, I have not seen the need for
Additionally,
Two ways of storing these new types were discussed.
The first is to add a new repeated field to the layer with object values, and then have new Value fields of repeated uint32 array_val
and repeated uint32 kv_val
, where array_val
is a list of object value indexes and kv_val is a list of layer key indexes and layer object value indexes.
Pros:
Cons:
The second is to add new Value fields. These would be as above, but instead refer to Layer key indexes and Layer value indexes. An additional restriction in the text would be that array_val and kv_val fields can only refer to Values in the values list which are of the currently existing types.
Pros:
Cons:
I also added a null type as 10 so that all the JSON types are covered.
I recommend not adding null
@flippmoke Can you create a dev branch in the repo which is the same as the master, except with 2.1 copied to 3.0? This will let us make PRs against it
@pnorman yes, I will setup a 3.0 dev branch. I have a feeling that a 3.0 branch will be much harder to find a consensus though compared to the 2.0. So don't expect any quick merges
I just re-posted my proposed change as #89. In the future, we should switch to 4 space tabs - easier to view diffs
@pnorman I have no objection to dropping key-value pairs where the value is null, which Tippecanoe (and, I think, Mapnik) does now when it encounters them. What do you think the right thing is to do when someone specifies null as an element of an array, though?
I recommend not adding null
@pnorman why is that?
@ericfischer Our current JS implementation interprets a missing value (e.g. a pbf message with no field set) as null
, and I think that'd be preferable to having an explicit null
value.
Thanks @kkaefer. I've updated https://github.com/mapbox/tippecanoe/pull/303 to encode nulls as empty messages rather than as their own type.
The problem with this that I just discovered is that empty packed arrays in protocol buffers are indistinguishable from an empty message, so empty lists and hashes are mistaken for null values. I think we need magic empty-list and empty-hash messages (or a magic value index that means the same).
In https://github.com/mapbox/tippecanoe/pull/303/commits/9319f2e1b2022 I propose:
// 0 signifies an empty hash, 1 signifies an empty list
optional uint64 empty_value = 10;
as a way to represent empty lists and hashes.
It would be great to add an example of encoding an
Array
orObject
from the original GeoJSON to the PBF value (from the4.5
example) to make it clear how the v2 spec should encode these values. Are the values encoded as string values and required to be parsed by the user?i.e. for an array value:
Related to https://github.com/mapbox/vector-tiles/issues/12
cc @flippmoke @danswick