mapbox / vector-tile-spec

Mapbox Vector Tile specification
https://www.mapbox.com/vector-tiles/specification/
Other
899 stars 212 forks source link

Complex Properties Overhaul #121

Closed flippmoke closed 6 years ago

flippmoke commented 6 years ago

This is different then the more simple approach to adding lists and maps in #117. This is a complete overhaul of the way properties are encoded. The thought process behind this change is to allow for higher levels of compression of values by:

This also allows for null values.

Solves #75 and #62

flippmoke commented 6 years ago

Another slight changed proposed by @kkaefer I should mention (but not reflected currently)

using the special index "0" to mean "specified inline" for any of the index types

This could allow all types to be inlined.

e-n-f commented 6 years ago

An experimental implementation of this scheme is in https://github.com/mapbox/tippecanoe/pull/611

In a first, superficial test, it reduces the Natural Earth countries (to z5) to 99.55% of their usual size. There may be other data sets that show meaningful improvement though.

flippmoke commented 6 years ago

I have experimented a little more with your branch and put in a few different sets of data:

North American Road Data

For one set of data, I used some north american road data, here are the results of file sizes:

-rw-r--r-- 1 mthompson 410M Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 430M Jul 23 11:42 na_roads_master.mbtiles

or

-rw-r--r-- 1 mthompson 428904448 Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 450768896 Jul 23 11:42 na_roads_master.mbtiles

This is about ~5% reduction in size.

Sample data format properties:

"properties": { "prefix": null, "number": "331", "class": "State", "type": "Other Paved", "divided": null, "country": "United States", "state": "Alabama", "note": null, "scalerank": 11, "uident": 142, "length": 4.559190, "rank": 0, "continent": "North America" }

My guess would be that we are seeing a reduction due to inline integers mostly in this case.

OSM Based Point Data

File sizes:

-rw-r--r-- 1 mthompson 832K Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 888K Jul 23 11:43 kathmandu_master.mbtiles

or

-rw-r--r-- 1 mthompson 851968 Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 909312 Jul 23 11:43 kathmandu_master.mbtiles

This is about a 7% reduction.

Example properties:

"properties": { "osm_id": 3483919843.0, "access": null, "aerialway": null, "aeroway": null, "amenity": null, "area": null, "barrier": null, "bicycle": null, "brand": null, "bridge": null, "boundary": null, "building": null, "capital": null, "covered": null, "culvert": null, "cutting": null, "disused": null, "ele": null, "embankment": null, "foot": null, "harbour": null, "highway": null, "historic": null, "horse": null, "junction": null, "landuse": null, "layer": null, "leisure": null, "lock": null, "man_made": null, "military": null, "motorcar": null, "name": null, "natural": null, "oneway": null, "operator": null, "poi": null, "population": null, "power": null, "place": null, "railway": null, "ref": null, "religion": null, "route": null, "service": null, "shop": null, "sport": null, "surface": null, "toll": null, "tourism": null, "tower:type": null, "tunnel": null, "water": null, "waterway": null, "wetland": null, "width": null, "wood": null, "z_order": null, "tags": "\"ford\"=>\"yes\"" }
e-n-f commented 6 years ago

Interesting. Thanks for the additional research. Can you add links to the files you are testing with?

e-n-f commented 6 years ago

New experiment with sorting the values but retaining the v2 encoding:

➤ ./tippecanoe -Voriginal -zg -f -o kathmandu-original.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vreordered -zg -f -o kathmandu-reordered.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ./tippecanoe -Vblake -zg -f -o kathmandu-blake.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
  99.9%  11/1508/860
➤ ls -l kathmandu-*mbtiles
-rw-r--r-- 1 enf staff 577536 Jul 23 14:58 kathmandu-blake.mbtiles
-rw-r--r-- 1 enf staff 655360 Jul 23 14:58 kathmandu-original.mbtiles
-rw-r--r-- 1 enf staff 602112 Jul 23 14:58 kathmandu-reordered.mbtiles
➤ ./tippecanoe -Voriginal --no-tile-size-limit -zg -f -o neroads-original.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vreordered --no-tile-size-limit -zg -f -o neroads-reordered.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/41/96
➤ ./tippecanoe -Vblake --no-tile-size-limit -zg -f -o neroads-blake.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
  99.9%  8/57/96
➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18317312 Jul 23 15:04 neroads-reordered.mbtiles

At least now I know it's worth spending a little extra time to sort the values before writing out the tile, even if there is some additional advantage to either inlining values or using repeated messages.

e-n-f commented 6 years ago

The Natural Earth roads are improved slightly by also sorting the keys:

➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18292736 Jul 23 15:17 neroads-reordered.mbtiles
e-n-f commented 6 years ago

Blake's format, but without inline ints:

So I think inlining is helping more than repeated messages are.

e-n-f commented 6 years ago

Inlining floats does help a little, but the difference is in the noise (91.97% vs 92.16%):

-rw-r--r-- 1 enf staff 17403904 Jul 24 10:21 neroads-blake-float.mbtiles
-rw-r--r-- 1 enf staff 17440768 Jul 24 10:20 neroads-blake-regular.mbtiles

This also highlights that we need more than 3 bits for types. In fact this PR already actually requires 4, because it specifies "list / map" as type 8, which won't fit in 3 bits. I'll add a 4th type bit to the test implementation and recalculate.

e-n-f commented 6 years ago

Adding the 4th type bit raises the roads from using 92.16% to 92.49% of the original tileset size.

-rw-r--r-- 1 enf staff 17502208 Jul 24 10:31 neroads-blake-regular.mbtiles
mourner commented 6 years ago

Also question — how does an encoder decide whether to inline a value or put it in the packed array? Should we always inline int/sint vlues? And if we remove "index to int/sint" types, this makes the types fit into 3 bytes again (including list/map without an additional bit).

flippmoke commented 6 years ago

@mourner We can not currently remove the indexed integer types because there is a limit to the size that the inlined values can represent currently. Here is how @ericfischer currently did the implementation for when to use inline vs indexed. I think this makes sense overall, the only time there might be a bit savings by using index over inline would be values that are larger then could be represented by 24 bit integers that are highly repeated (probably more then 3 or 4 times) and would have a low index value. This could be calculated on the fly if required, but I don't know that we need to have such a complex implementation. I think overall the inline seems to save space on average so we may not need such complex code.

joto commented 6 years ago

Here are some random thoughts:

e-n-f commented 6 years ago
e-n-f commented 6 years ago

The mantissas of floating point numbers seem to be fairly uniformly distributed across the [.5…1) interval, so there's probably not much potential for giving more common mantissas shorter representations. Low exponents are more common than high ones, though, so we might be able to squeeze a little bit out there.

joto commented 6 years ago

Regarding special encoding of floating point numbers: I don't think it is worth it to come up with complex schemes here. I had thought about just using raw bytes stored in a string field or something. But while that might be easy to use in C++, it will be more difficult in JS or so.

Regarding the keys/string_values tables: With the encoding proposed here it doesn't cost us anything to split these up, because each string is encoded by itself anyway. But as mentioned it will lead to smaller index numbers which, especially for the keys case is probably worth it. Here is another idea though: Currently all keys/string_values are directly in the layer object, if we push this down one level and have an intermediate string_table object, it could be more efficient. It would allow us to jump over the whole table or copy the whole table in one go. The cost is one more byte for the type and one varint for the length of the whole table. Double that if we have separate keys/string_values tables.

Regarding integer value encodings: If we put large integers that can't be inlined as separate varint in the properties array, we will hit a bad case for varints. Because they are always large, chances are they will get even larger as varints (max 10 bytes compared to 8 bytes for the int itself). So there is some inefficiency there. On the other hand, if we want to put them in an index table, we can use a fixed size type instead of a varint, which would avoid this and also make access more efficient, because we can directly address values in those tables without having to decode them first. So I think if we keep the tables, they should be of type (s)fixed32/64 instead of (s/u)int32/64. It would still be an indirect access which likely is slower than inlining though.

We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.

This is one of those cases where we are hitting the limits of protobuf encoding again. We know the type, so we could do the zigzag encoding ourselfs for sints and not for uints. For the C++ code this doesn't matter, because we do the zigzag encoding ourselves anyway, but for anybody using the protobuf encodings, we either need two tables, or they have to do the zigzag encoding outside the protobuf lib.

e-n-f commented 6 years ago

On a different topic:

flippmoke commented 6 years ago

Closing in favor of #123