How are property values that are objects/arrays encoded in vector tiles?

mapsam commented 8 years ago

It would be great to add an example of encoding an Array or Object from the original GeoJSON to the PBF value (from the 4.5 example) to make it clear how the v2 spec should encode these values. Are the values encoded as string values and required to be parsed by the user?

i.e. for an array value:

values: {
  string_value: "[\"one\",\"two\",3]"
}

cc @flippmoke @danswick

jfirebaugh commented 8 years ago

My understanding is that array and object values are not part of the specification, and if you want to use them you need to invent your own serialization protocol and ensure that the encoders and decoders in your part of the ecosystem understand it. In other words, there is no "should" here at the level of the VT spec.

mapsam commented 8 years ago

Thanks @jfirebaugh - this is the case.

e-n-f commented 8 years ago

Mapbox Studio Classic and recent versions of Tippecanoe write them as stringified JSON, with no way to tell whether the value originally came from a string or an object.

mapsam commented 8 years ago

with no way to tell whether the value originally came from a string or an object

Interesting @ericfischer. Does that mean parsers must try to parse as JSON first, then assume it's a string if the JSON cannot be parsed?

e-n-f commented 8 years ago

@mapsam I think most parsers just treat them as strings and don't make any attempt to reconstitute objects. But @ingalls may have more details on what ultimately happens with the values that prompted https://github.com/mapbox/tippecanoe/issues/155.

mapsam commented 8 years ago

Ah, so it's up the user then to know their string/object-string values. Makes sense!

mcwhittemore commented 8 years ago

Here is one place this conversation is happening in mapbox-gl-js.

ingalls commented 8 years ago

@mapsam our use case is such that we expect either an array or a stringified array given the property name so for us parsing is relatively easy.

mcwhittemore commented 8 years ago

@mapsam mentioned reopening this to have a convo about adding how to handle objects and arrays into the spec.

@flippmoke what are your thoughts on adding this in?

pnorman commented 8 years ago

+1 to adding it in - multiple people are adding ways to represent lists (e.g. tilezen/mapbox-vector-tile#64)

nyurik commented 8 years ago

I have been talking to a large number of people about this at SOTM 2016. The biggest hurdle to store the unlimited key-values (object/dictionary/hstore) and the array support is that Mapnik is fairly rigid in terms of non-simple-type columns, and the number of those columns, so adding this support to Mapnik would require substantial efforts. I would like to propose both the Mapnik workaround as well as the spec change.

To generate an object or an array in vtile using Mapnik, one could generate the data as a single JSON-encoded string value, and add a custom server-side vector post-processing step that decodes JSON and converts it into values. This way, "name": {"en":"London","hu":"London","fr":"Londres"} would be stored as a long string, with multiple identical strings repeating, and the post-processor would convert it to 3 individual tags - "name:en":"London", "name:hu":"London", "name:fr":"Londres" (thus deduping the values).

There are two ways to actually store it:

Simple way that does not alter this spec -- simply add all keys and values as the layer message's keys and values, and update the tags fields. This might break some software's expectation that the list of tag columns is predefined, but otherwise no spec change is needed. One drawback of this approach is that the keys would be much longer - "name:en" instead of "en". Once implemented, a collate() function needs to be implemented to handle multiple languages - collate(tags,"name:fr","name:hu","name:en") would get the first existing value.
A more complex way is to extend the schema with two additional value types: dictionary and list.

The Value message should have two additional fields:

    // A list of key-value pairs. Just like Feature's "tags", the repeated pairs
    // of integers point to Layer's "keys" (1st integer) and "values" (2nd integer).
    repeated uint32 object_value = 8 [ packed = true ];

    // An array of values. Similar to Feature's "tags", each integer represents
    // an index of the Layer's "values" array.
    repeated uint32 list_value = 9 [ packed = true ];

So if the value contains object_value, the odd integers are the indexes of the Layer's keys array, and the even indexes refer to the Layer's values array. Arrays are implemented the same way, only this time list_value indexes only refer to the Layer's values array.

One of the drawback is that it is now possible to have circular dependencies, so to avoid this and simplify client's data validation, data may only be two levels deep (root+one sublevel). Another unenforced but possibly valuable requirement might be to restrict all values in both object and list to have the same type.

flippmoke commented 8 years ago

Sorry for my lack of responses on this ticket: Let me summarize some thoughts I have on this currently.

Mapnik Free Vector Tiles

I would like to propose both the Mapnik workaround as well as the spec change.

Eventually we do want to pull out all Mapbox vector tile encoding and decoding such that it is not required to be a part of Mapnik. In general mapnik-vector-tile issues should probably be discussed there. Here on the spec trying to specifically just cover ideas specific to the specification.

Lists

List support could be added by changing section 4.4 - which currently reads as follows:

Feature attributes are encoded as pairs of integers in the tag field of a feature. The first integer in each pair represents the zero-based index of the key in the keys set of the layer to which the feature belongs. The second integer in each pair represents the zero-based index of the value in the values set of the layer to which the feature belongs. Every key index MUST be unique within that feature such that no other attribute pair within that feature has the same key index. A feature MUST have an even number of tag fields. A feature tag field MUST NOT contain a key index or value index greater than or equal to the number of elements in the layer's keys or values set, respectively.

In theory we could allow keys not to be unique per a feature and if more then one occurs we assume the item to be a list. This is not ideal for compression necessarily but could solve the problem.

I do not think this change could qualify as anything but a major version incrementation (though we could debate on this).

Objects

Objects scare me quite a bit more because we suddenly are adding depth "values". This could be something that was very bad for performance overall if not implemented with thought and with some testing.

Example Implementation:

       message Object {
               // Same as feature tags currently
               repeated uint32 tags = 2 [ packed = true ];
       }

       message Value {
                // Exactly one of these values must be present in a valid message
                optional string string_value = 1;
                optional float float_value = 2;
                optional double double_value = 3;
                optional int64 int_value = 4;
                optional uint64 uint_value = 5;
                optional sint64 sint_value = 6;
                optional bool bool_value = 7;
                optional Object object_value = 8; 
                extensions 9 to max;
        }

Final Thoughts

I am going to leave this ticket open for continued discussion. Thanks for the feedback everyone!

flippmoke commented 8 years ago

@nyurik part of the challenge here with both of our ideas is that that means that all non object values MUST be encoded first. This would put a variety of different requirements on both the encoders and decoders.

nyurik commented 8 years ago

@flippmoke could you elaborate on your last comment? I'm not sure I understood.

flippmoke commented 8 years ago

@nyurik

Lets just assume the following structure:

{
    "key1": value1(int)
    "key2": value2(object) {
           "key3": value3(object) {
                 "key4": value4(int),
                 "key5": value5(int)
           },
           "key6": value6(int)
     },
     "key7": value7(int)
}

Because Value now has an "Object" that must know the index of keys and values. The order of encoding becomes critical.

value4 and value5 would have to be in the before value3.
value3 and value6 would have to be encoded before value2.

Edit: by encoded here I mean placed into values on the Layer.

                // Dictionary encoding for keys
                repeated string keys = 3;

                // Dictionary encoding for values
                repeated Value values = 4;

nyurik commented 8 years ago

@flippmoke ah, yes, we (@pnorman and I) were discussing this. I don't think we actually want to allow unlimited nest-ness. It has very limited practical value, whereas having just one level of unlimited key-value storage has a huge practical benefit - primarily for storing all possible translations of a given object (key=name, value={language-code => string}. The moment we put this restriction in place, the whole system becomes much more manageable and easy to implement:

No need to validate invalid recursive tags (value3 being the same as value2 in your example)
Encoding all the object's keys & values is a linear process - you add them to the Layer's keys & values, record their location, and save it to tags. This is identical to how the tags field is being created now.

e-n-f commented 8 years ago

If we're going to allow nesting at all, and it requires a version bump, why not allow arbitrary JSON objects?

Values are already tagged with an integer for their type, and only 1 through 7 are defined. So type 8 could denote that the value is an array (encoded as an integer array of sub-values) and type 9 denote that it is a hash (encoded as an integer array of sub-keys alternating with sub-values).

So a feature with this attribute:

"nested" : { "one" : "two", "three" : [ 4, 5, 6 ] }

would have these keys in the layer:

0. nested
1. one
2. three

and these values:

0. two (string)
1. 4 (int)
2. 5 (int)
3. 6 (int)
4. 1,2,3 (list)
5. 1,0,2,4 (hash)

with an outer key-value list of

0,5

e-n-f commented 8 years ago

I just realized that this was what @nyurik already said. Sorry for the needless duplication.

nyurik commented 8 years ago

@ericfischer I'm not sure it's the same - I actually think that we should NOT allow arbitrary nested levels, as it would create too many encoding and decoding problems, while not giving us much practical benefits. After all, vector tiles is a very domain-specific data storage format. My proposal is either to not change the spec at all (except that now a vector tile would have any number of non-predefined keys), or, if we change the spec, we only allow two new types (list and object) at the top level. Neither list nor object will allow storing other lists or objects inside of them for the reasons outlined above.

nyurik commented 8 years ago

Encoding example

{ "Features": [
  { "str": "aaa",
    "dic": { "en": "aaa", "fr": "bbb"},
    "arr": [ 100, 200 ]
  }, {
    "dic": { "en": "ccc", "gr": "bbb" },
    "arr": [ 100, 300 ]
  }
]}

VTile structure:

      # 0      1     2     3      4      5
keys: [ "str", "en", "fr", "dic", "arr", "gr" ]
values: [
  "aaa"     (string) # 0
  "bbb"     (string) # 1
                     #    key value pairs, value MUST NOT be object or array,
  [1,0,2,1] (object) # 2  all value indexes MUST be less than the index of this element.
  100       (int)    # 3
  200       (int)    # 4
  [3,4]     (array)  # 5  list of value indexes. Same restrictions as for Object
  "ccc"     (string) # 6
  [1,6,5,1] (object) # 7
  300       (int)    # 8
  [3,8]     (array)  # 9
]
features: [
  { tags: [0,0, 3,2, 4,5] },
  { tags: [3,7, 4,9] }
]

@flippmoke, I'm not so sure that repeating keys to make a list is better than adding a new value type. The decoder would need to do an additional key lookup for every new key, to make sure it hasn't seen it before (performance), and because the key needs to be repeated (compression). UPDATE: @flippmoke yes, the values contained in the object have to be encoded before the object itself (same for arrays). But because nesting is not allowed, this would not be a problem.

e-n-f commented 8 years ago

Maybe to make validation easier, there could be a rule that a value can only refer to values that are numbered lower than itself? This would prevent reference loops.

nyurik commented 8 years ago

@ericfischer, while it would make the validation easier, it would also make encoding slightly slower and harder. Are there any practical reasons to allow infinite nesting? @pnorman mentioned that allowing nesting inside arrays would make multidimensional (Jagged) arrays possible, but I haven't seen a usage example yet.

e-n-f commented 8 years ago

I have seen one case where someone embedded an entirely different GeoJSON feature as an attribute of a GeoJSON feature. I don't know how common this need is in reality though.

From my perspective, guaranteeing that sub-values will be numbered below the value that refers to them seems easy, because you can't build your list of references until you have already assigned IDs to the sub-values, so the ID that then gets assigned to the list of references, once it is complete, will have to be greater than the IDs that were assigned earlier in the process to the things that went into it. But there might be reasons to want to write it the other way around instead. (I should try writing an implementation of this to make sure I'm not missing something important.)

e-n-f commented 8 years ago

@nyurik Whether or not it is a good idea, here is my experimental implementation of it in Tippecanoe: https://github.com/mapbox/tippecanoe/pull/303

nyurik commented 8 years ago

@ericfischer, thanks. I will attempt to do the same with the https://github.com/mapbox/vector-tile-js .

While we are introducing a major change, question for proto-buf gurus: does it make sense to make keys and values to have lower ID number than features? Would it force protobuf to store features at the end, ensuring that a parser could read content sequentially, instead of skipping features and then coming back? I just looked at the https://github.com/mapbox/pbf/blob/master/index.js#L183 - sees skipping values is a linear traversal.

flippmoke commented 8 years ago

@ericfischer @nyurik could you both drop sample .proto files to make it easier to understand exactly what you are doing

nyurik commented 8 years ago

@flippmoke https://github.com/mapbox/vector-tile-spec/pull/88 (keeping it in 2.1 for easy diffing). I am not sure if we need to introduce new message types.

e-n-f commented 8 years ago

@flippmoke @nyurik Here is my version of the .proto change. I also added a null type as 10 so that all the JSON types are covered.

e-n-f commented 8 years ago

I had used 8 for lists and 9 for hashes yesterday, but just switched them now to be consistent with @nyurik's version.

pnorman commented 8 years ago

To generate an object or an array in vtile using Mapnik...

Mapnik faces certain limitations with generating MapBox Vector Tiles, but it's not the only generator implementation, and all generator implementations will require changes for encoding key=>val stores or arrays as property values.

Simple way that does not alter this spec ...

People are already stuffing strings with special meaning into property values, which hopefully we can avoid. Continuing to do this requires no changes to the spec, so isn't really relevant here.

Also, the space differences between 'name:lang' and '*:lang' are tiny in most cases, even more reduced by compression, and speak to a particular vector tile schema which others may not be using.

Based on discussions I've had with others, the two features needed are a key=>val store of strings and array support. In rough order of significance, I have not seen the need for

Recursive data structures
Nested data structures
Non-string key=>val store

Additionally,

Deduplicating kv keys and values is important, but deduplicating the entire kv store is of less value.
There will be duplication between kv values and object values, but not so much with keys

Two ways of storing these new types were discussed.

The first is to add a new repeated field to the layer with object values, and then have new Value fields of repeated uint32 array_val and repeated uint32 kv_val, where array_val is a list of object value indexes and kv_val is a list of layer key indexes and layer object value indexes.

Pros:

Can prevent recursion and nesting on a protobuf level, rather than relying on language in the spec
Deduplicates kv value keys and layer keys with each other
Deduplicates kv/array values

Cons:

Doesn't deduplicate kv value values with layer values
Requires another repeated element on the Layer, which strikes me as less elegant

The second is to add new Value fields. These would be as above, but instead refer to Layer key indexes and Layer value indexes. An additional restriction in the text would be that array_val and kv_val fields can only refer to Values in the values list which are of the currently existing types.

Pros:

Deduplicates kv value values and layer values with each other
Feels more elegant

Cons:

Puts more restriction into the spec text

pnorman commented 8 years ago

I also added a null type as 10 so that all the JSON types are covered.

I recommend not adding null

@flippmoke Can you create a dev branch in the repo which is the same as the master, except with 2.1 copied to 3.0? This will let us make PRs against it

flippmoke commented 8 years ago

@pnorman yes, I will setup a 3.0 dev branch. I have a feeling that a 3.0 branch will be much harder to find a consensus though compared to the 2.0. So don't expect any quick merges

flippmoke commented 8 years ago

@pnorman https://github.com/mapbox/vector-tile-spec/tree/v3.0-development

nyurik commented 8 years ago

I just re-posted my proposed change as #89. In the future, we should switch to 4 space tabs - easier to view diffs

e-n-f commented 8 years ago

@pnorman I have no objection to dropping key-value pairs where the value is null, which Tippecanoe (and, I think, Mapnik) does now when it encounters them. What do you think the right thing is to do when someone specifies null as an element of an array, though?

kkaefer commented 7 years ago

I recommend not adding null

@pnorman why is that?

@ericfischer Our current JS implementation interprets a missing value (e.g. a pbf message with no field set) as null, and I think that'd be preferable to having an explicit null value.

e-n-f commented 7 years ago

Thanks @kkaefer. I've updated https://github.com/mapbox/tippecanoe/pull/303 to encode nulls as empty messages rather than as their own type.

e-n-f commented 6 years ago

The problem with this that I just discovered is that empty packed arrays in protocol buffers are indistinguishable from an empty message, so empty lists and hashes are mistaken for null values. I think we need magic empty-list and empty-hash messages (or a magic value index that means the same).

e-n-f commented 6 years ago

In https://github.com/mapbox/tippecanoe/pull/303/commits/9319f2e1b2022 I propose:

                // 0 signifies an empty hash, 1 signifies an empty list
                optional uint64 empty_value = 10;

as a way to represent empty lists and hashes.

mapbox / vector-tile-spec