Complex Properties Overhaul

flippmoke commented 6 years ago

This is different then the more simple approach to adding lists and maps in #117. This is a complete overhaul of the way properties are encoded. The thought process behind this change is to allow for higher levels of compression of values by:

Allowing integer values to be inlined rather then point to indexes
Indexed positions of properties are now to packed types
Keys and values share the same string storage system

This also allows for null values.

Solves #75 and #62

flippmoke commented 6 years ago

Another slight changed proposed by @kkaefer I should mention (but not reflected currently)

using the special index "0" to mean "specified inline" for any of the index types

This could allow all types to be inlined.

e-n-f commented 6 years ago

An experimental implementation of this scheme is in https://github.com/mapbox/tippecanoe/pull/611

In a first, superficial test, it reduces the Natural Earth countries (to z5) to 99.55% of their usual size. There may be other data sets that show meaningful improvement though.

flippmoke commented 6 years ago

I have experimented a little more with your branch and put in a few different sets of data:

North American Road Data

For one set of data, I used some north american road data, here are the results of file sizes:

-rw-r--r-- 1 mthompson 410M Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 430M Jul 23 11:42 na_roads_master.mbtiles

or

-rw-r--r-- 1 mthompson 428904448 Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 450768896 Jul 23 11:42 na_roads_master.mbtiles

This is about ~5% reduction in size.

Sample data format properties:

"properties": { "prefix": null, "number": "331", "class": "State", "type": "Other Paved", "divided": null, "country": "United States", "state": "Alabama", "note": null, "scalerank": 11, "uident": 142, "length": 4.559190, "rank": 0, "continent": "North America" }

My guess would be that we are seeing a reduction due to inline integers mostly in this case.

OSM Based Point Data

File sizes:

-rw-r--r-- 1 mthompson 832K Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 888K Jul 23 11:43 kathmandu_master.mbtiles

or

-rw-r--r-- 1 mthompson 851968 Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 909312 Jul 23 11:43 kathmandu_master.mbtiles

This is about a 7% reduction.

Example properties:

"properties": { "osm_id": 3483919843.0, "access": null, "aerialway": null, "aeroway": null, "amenity": null, "area": null, "barrier": null, "bicycle": null, "brand": null, "bridge": null, "boundary": null, "building": null, "capital": null, "covered": null, "culvert": null, "cutting": null, "disused": null, "ele": null, "embankment": null, "foot": null, "harbour": null, "highway": null, "historic": null, "horse": null, "junction": null, "landuse": null, "layer": null, "leisure": null, "lock": null, "man_made": null, "military": null, "motorcar": null, "name": null, "natural": null, "oneway": null, "operator": null, "poi": null, "population": null, "power": null, "place": null, "railway": null, "ref": null, "religion": null, "route": null, "service": null, "shop": null, "sport": null, "surface": null, "toll": null, "tourism": null, "tower:type": null, "tunnel": null, "water": null, "waterway": null, "wetland": null, "width": null, "wood": null, "z_order": null, "tags": "\"ford\"=>\"yes\"" }

e-n-f commented 6 years ago

Interesting. Thanks for the additional research. Can you add links to the files you are testing with?

e-n-f commented 6 years ago

New experiment with sorting the values but retaining the v2 encoding:

➤ ./tippecanoe -Voriginal -zg -f -o kathmandu-original.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vreordered -zg -f -o kathmandu-reordered.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vblake -zg -f -o kathmandu-blake.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ls -l kathmandu-*mbtiles
-rw-r--r-- 1 enf staff 577536 Jul 23 14:58 kathmandu-blake.mbtiles
-rw-r--r-- 1 enf staff 655360 Jul 23 14:58 kathmandu-original.mbtiles
-rw-r--r-- 1 enf staff 602112 Jul 23 14:58 kathmandu-reordered.mbtiles

Blake format: 88% of previous size
Reordered format: 92% of original size

➤ ./tippecanoe -Voriginal --no-tile-size-limit -zg -f -o neroads-original.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vreordered --no-tile-size-limit -zg -f -o neroads-reordered.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vblake --no-tile-size-limit -zg -f -o neroads-blake.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/57/96
➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18317312 Jul 23 15:04 neroads-reordered.mbtiles

Blake format: 92% of original size
Reordered format: 97% of original size

At least now I know it's worth spending a little extra time to sort the values before writing out the tile, even if there is some additional advantage to either inlining values or using repeated messages.

e-n-f commented 6 years ago

The Natural Earth roads are improved slightly by also sorting the keys:

➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18292736 Jul 23 15:17 neroads-reordered.mbtiles

e-n-f commented 6 years ago

Blake's format, but without inline ints:

Kathmandu: 94% instead of 88%
Natural Earth roads: 97% instead of 92%

So I think inlining is helping more than repeated messages are.

e-n-f commented 6 years ago

Inlining floats does help a little, but the difference is in the noise (91.97% vs 92.16%):

-rw-r--r-- 1 enf staff 17403904 Jul 24 10:21 neroads-blake-float.mbtiles
-rw-r--r-- 1 enf staff 17440768 Jul 24 10:20 neroads-blake-regular.mbtiles

This also highlights that we need more than 3 bits for types. In fact this PR already actually requires 4, because it specifies "list / map" as type 8, which won't fit in 3 bits. I'll add a 4th type bit to the test implementation and recalculate.

e-n-f commented 6 years ago

Adding the 4th type bit raises the roads from using 92.16% to 92.49% of the original tileset size.

-rw-r--r-- 1 enf staff 17502208 Jul 24 10:31 neroads-blake-regular.mbtiles

mourner commented 6 years ago

Also question — how does an encoder decide whether to inline a value or put it in the packed array? Should we always inline int/sint vlues? And if we remove "index to int/sint" types, this makes the types fit into 3 bytes again (including list/map without an additional bit).

flippmoke commented 6 years ago

@mourner We can not currently remove the indexed integer types because there is a limit to the size that the inlined values can represent currently. Here is how @ericfischer currently did the implementation for when to use inline vs indexed. I think this makes sense overall, the only time there might be a bit savings by using index over inline would be values that are larger then could be represented by 24 bit integers that are highly repeated (probably more then 3 or 4 times) and would have a low index value. This could be calculated on the fly if required, but I don't know that we need to have such a complex implementation. I think overall the inline seems to save space on average so we may not need such complex code.

joto commented 6 years ago

Here are some random thoughts:

Large integers that can't be inlined (due to us using those 4 bits for the type) could still be inlined by having a special type that says: next integer is the actual value.
We might not need both the uint64/sint64_values tables, because internally they are both varints and we know the type from the type field
We could put all tables into one large buffer using offsets instead of indexes into those tables. the value type we know already. From an encoding point of view all those tables in a layer are a problem, because we need to keep them around in memory until the layer is finished, so this could simplify things. (A variant of this would be to store the data inline the first time and after that use offsets.)
The proposal removes the distinction between keys and values tables, only has the string_values field. This simplifies things slightly (and saves a few bytes for the second table header) but makes the often used keys values larger probably (unless you take care to first put all keys into the string table). It also makes reuse of the keys table not possible in shaving or similar use case. Also creating this table is more expensive in the first place, because the key space is usually small which makes it easier to find the index of a given key. On the other hand with nested maps having a single table is a bit easier, because there is no question where all the strings are.
Keys are always strings, so we don't need to store the type for the keys. Saves the 4 bits.
How common is the case where float/double values are used that are actually multiple times in the layer so the lookup table makes sense? Maybe it is better to convert them to ints somehow and store them inline?

e-n-f commented 6 years ago

I am OK with inlining large integers as internal varints if we're also doing that with lists and maps. The value I see to not making any types variable-length is that it is nice to be able to know how many attributes there are just by dividing the length of the list by 2. But I'm not sure how much that really matters.
We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.
Is there a way to represent the one-large-buffer approach in standard protobuf syntax, or does that make the format protozero-only?
No objection from me to a separate keys table.
I tried inlining floats and it didn't make much difference in size, and makes the format harder to describe and implement. Inlining doubles would require a larger-than-64-bit integer type to pack them into. It might be worth trying encoding the mantissa and exponent as a pair of varints and see how that works out, though.

e-n-f commented 6 years ago

The mantissas of floating point numbers seem to be fairly uniformly distributed across the [.5…1) interval, so there's probably not much potential for giving more common mantissas shorter representations. Low exponents are more common than high ones, though, so we might be able to squeeze a little bit out there.

joto commented 6 years ago

Regarding special encoding of floating point numbers: I don't think it is worth it to come up with complex schemes here. I had thought about just using raw bytes stored in a string field or something. But while that might be easy to use in C++, it will be more difficult in JS or so.

Regarding the keys/string_values tables: With the encoding proposed here it doesn't cost us anything to split these up, because each string is encoded by itself anyway. But as mentioned it will lead to smaller index numbers which, especially for the keys case is probably worth it. Here is another idea though: Currently all keys/string_values are directly in the layer object, if we push this down one level and have an intermediate string_table object, it could be more efficient. It would allow us to jump over the whole table or copy the whole table in one go. The cost is one more byte for the type and one varint for the length of the whole table. Double that if we have separate keys/string_values tables.

Regarding integer value encodings: If we put large integers that can't be inlined as separate varint in the properties array, we will hit a bad case for varints. Because they are always large, chances are they will get even larger as varints (max 10 bytes compared to 8 bytes for the int itself). So there is some inefficiency there. On the other hand, if we want to put them in an index table, we can use a fixed size type instead of a varint, which would avoid this and also make access more efficient, because we can directly address values in those tables without having to decode them first. So I think if we keep the tables, they should be of type (s)fixed32/64 instead of (s/u)int32/64. It would still be an indirect access which likely is slower than inlining though.

We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.

This is one of those cases where we are hitting the limits of protobuf encoding again. We know the type, so we could do the zigzag encoding ourselfs for sints and not for uints. For the C++ code this doesn't matter, because we do the zigzag encoding ourselves anyway, but for anybody using the protobuf encodings, we either need two tables, or they have to do the zigzag encoding outside the protobuf lib.

e-n-f commented 6 years ago

Glad to hear that inlining floats sounds like it is off the table.
I'm fine with putting keys and strings in separate tables, and within a container object if that is considered useful. I'll try changing my prototype to do that. Should we put all the attributes inside that message, or is there a case where readers just want to skip/copy the strings?
Good point that the integers-by-reference should be fixed-size instead of varint, since they will always be large if they didn't get inlined. I'll change my prototype and this .proto file to do that.
If we talk about inlining signed integers at all, we are inherently doing bit-packing, so we have to explicitly talk about either zigzagging or sign extension, and zigzagging is probably the better choice of the two. All clients, no matter what language they are written in, will have to be able to unpack inlined integers.

On a different topic:

If we inline lists and hashes, meaning that we mix single-word and multi-word values in the attribute list, we need to be clear about which sets of types occupy only a single slot and which refer to suffixes, so that soon-to-exist clients can skip over types that future versions of the standard may define. I think we should be explicit that types 9 through 15 only use a single slot and are either single-word inline types or reference types, not multi-word inline types.

flippmoke commented 6 years ago

Closing in favor of #123

mapbox / vector-tile-spec

Complex Properties Overhaul #121

North American Road Data

OSM Based Point Data