Closed flippmoke closed 6 years ago
Another slight changed proposed by @kkaefer I should mention (but not reflected currently)
using the special index "0" to mean "specified inline" for any of the index types
This could allow all types to be inlined.
An experimental implementation of this scheme is in https://github.com/mapbox/tippecanoe/pull/611
In a first, superficial test, it reduces the Natural Earth countries (to z5) to 99.55% of their usual size. There may be other data sets that show meaningful improvement though.
I have experimented a little more with your branch and put in a few different sets of data:
For one set of data, I used some north american road data, here are the results of file sizes:
-rw-r--r-- 1 mthompson 410M Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 430M Jul 23 11:42 na_roads_master.mbtiles
or
-rw-r--r-- 1 mthompson 428904448 Jul 23 12:23 na_roads_blake.mbtiles
-rw-r--r-- 1 mthompson 450768896 Jul 23 11:42 na_roads_master.mbtiles
This is about ~5% reduction in size.
Sample data format properties:
"properties": { "prefix": null, "number": "331", "class": "State", "type": "Other Paved", "divided": null, "country": "United States", "state": "Alabama", "note": null, "scalerank": 11, "uident": 142, "length": 4.559190, "rank": 0, "continent": "North America" }
My guess would be that we are seeing a reduction due to inline integers mostly in this case.
File sizes:
-rw-r--r-- 1 mthompson 832K Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 888K Jul 23 11:43 kathmandu_master.mbtiles
or
-rw-r--r-- 1 mthompson 851968 Jul 23 12:17 kathmandu_blake.mbtiles
-rw-r--r-- 1 mthompson 909312 Jul 23 11:43 kathmandu_master.mbtiles
This is about a 7% reduction.
Example properties:
"properties": { "osm_id": 3483919843.0, "access": null, "aerialway": null, "aeroway": null, "amenity": null, "area": null, "barrier": null, "bicycle": null, "brand": null, "bridge": null, "boundary": null, "building": null, "capital": null, "covered": null, "culvert": null, "cutting": null, "disused": null, "ele": null, "embankment": null, "foot": null, "harbour": null, "highway": null, "historic": null, "horse": null, "junction": null, "landuse": null, "layer": null, "leisure": null, "lock": null, "man_made": null, "military": null, "motorcar": null, "name": null, "natural": null, "oneway": null, "operator": null, "poi": null, "population": null, "power": null, "place": null, "railway": null, "ref": null, "religion": null, "route": null, "service": null, "shop": null, "sport": null, "surface": null, "toll": null, "tourism": null, "tower:type": null, "tunnel": null, "water": null, "waterway": null, "wetland": null, "width": null, "wood": null, "z_order": null, "tags": "\"ford\"=>\"yes\"" }
Interesting. Thanks for the additional research. Can you add links to the files you are testing with?
New experiment with sorting the values but retaining the v2 encoding:
➤ ./tippecanoe -Voriginal -zg -f -o kathmandu-original.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
99.9% 11/1508/860
➤ ./tippecanoe -Vreordered -zg -f -o kathmandu-reordered.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
99.9% 11/1508/860
➤ ./tippecanoe -Vblake -zg -f -o kathmandu-blake.mbtiles ../kathmandu_nepal_osm_point.geojson
For layer 0, using name "kathmandu_nepal_osm_point"
12681 features, 2459902 bytes of geometry, 4 bytes of separate metadata, 389570 bytes of string pool
Choosing a maxzoom of -z11 for features about 228 feet (70 meters) apart
99.9% 11/1508/860
➤ ls -l kathmandu-*mbtiles
-rw-r--r-- 1 enf staff 577536 Jul 23 14:58 kathmandu-blake.mbtiles
-rw-r--r-- 1 enf staff 655360 Jul 23 14:58 kathmandu-original.mbtiles
-rw-r--r-- 1 enf staff 602112 Jul 23 14:58 kathmandu-reordered.mbtiles
➤ ./tippecanoe -Voriginal --no-tile-size-limit -zg -f -o neroads-original.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
99.9% 8/41/96
➤ ./tippecanoe -Vreordered --no-tile-size-limit -zg -f -o neroads-reordered.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
99.9% 8/41/96
➤ ./tippecanoe -Vblake --no-tile-size-limit -zg -f -o neroads-blake.mbtiles ../north-america-roads_natural-earth.geojson
For layer 0, using name "northamericaroads_naturalearth"
49183 features, 24440999 bytes of geometry, 2317328 bytes of separate metadata, 775873 bytes of string pool
Choosing a maxzoom of -z4 for features about 24225 feet (7384 meters) apart
Choosing a maxzoom of -z8 for resolution of about 1118 feet (340 meters) within features
99.9% 8/57/96
➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18317312 Jul 23 15:04 neroads-reordered.mbtiles
At least now I know it's worth spending a little extra time to sort the values
before writing out the tile, even if there is some additional advantage to either inlining values or using repeated messages.
The Natural Earth roads are improved slightly by also sorting the keys:
➤ ls -l neroads-*mbtiles
-rw-r--r-- 1 enf staff 17453056 Jul 23 15:05 neroads-blake.mbtiles
-rw-r--r-- 1 enf staff 18923520 Jul 23 15:03 neroads-original.mbtiles
-rw-r--r-- 1 enf staff 18292736 Jul 23 15:17 neroads-reordered.mbtiles
Blake's format, but without inline ints:
So I think inlining is helping more than repeated messages are.
Inlining floats does help a little, but the difference is in the noise (91.97% vs 92.16%):
-rw-r--r-- 1 enf staff 17403904 Jul 24 10:21 neroads-blake-float.mbtiles
-rw-r--r-- 1 enf staff 17440768 Jul 24 10:20 neroads-blake-regular.mbtiles
This also highlights that we need more than 3 bits for types. In fact this PR already actually requires 4, because it specifies "list / map" as type 8
, which won't fit in 3 bits. I'll add a 4th type bit to the test implementation and recalculate.
Adding the 4th type bit raises the roads from using 92.16% to 92.49% of the original tileset size.
-rw-r--r-- 1 enf staff 17502208 Jul 24 10:31 neroads-blake-regular.mbtiles
Also question — how does an encoder decide whether to inline a value or put it in the packed array? Should we always inline int/sint vlues? And if we remove "index to int/sint" types, this makes the types fit into 3 bytes again (including list/map without an additional bit).
@mourner We can not currently remove the indexed integer types because there is a limit to the size that the inlined values can represent currently. Here is how @ericfischer currently did the implementation for when to use inline vs indexed. I think this makes sense overall, the only time there might be a bit savings by using index over inline would be values that are larger then could be represented by 24 bit integers that are highly repeated (probably more then 3 or 4 times) and would have a low index value. This could be calculated on the fly if required, but I don't know that we need to have such a complex implementation. I think overall the inline seems to save space on average so we may not need such complex code.
Here are some random thoughts:
string_values
field. This simplifies things slightly (and saves a few bytes for the second table header) but makes the often used keys values larger probably (unless you take care to first put all keys into the string table). It also makes reuse of the keys table not possible in shaving or similar use case. Also creating this table is more expensive in the first place, because the key space is usually small which makes it easier to find the index of a given key. On the other hand with nested maps having a single table is a bit easier, because there is no question where all the strings are.The mantissas of floating point numbers seem to be fairly uniformly distributed across the [.5…1) interval, so there's probably not much potential for giving more common mantissas shorter representations. Low exponents are more common than high ones, though, so we might be able to squeeze a little bit out there.
Regarding special encoding of floating point numbers: I don't think it is worth it to come up with complex schemes here. I had thought about just using raw bytes stored in a string field or something. But while that might be easy to use in C++, it will be more difficult in JS or so.
Regarding the keys/string_values tables: With the encoding proposed here it doesn't cost us anything to split these up, because each string is encoded by itself anyway. But as mentioned it will lead to smaller index numbers which, especially for the keys case is probably worth it. Here is another idea though: Currently all keys/string_values are directly in the layer
object, if we push this down one level and have an intermediate string_table
object, it could be more efficient. It would allow us to jump over the whole table or copy the whole table in one go. The cost is one more byte for the type and one varint for the length of the whole table. Double that if we have separate keys/string_values tables.
Regarding integer value encodings: If we put large integers that can't be inlined as separate varint in the properties
array, we will hit a bad case for varints. Because they are always large, chances are they will get even larger as varints (max 10 bytes compared to 8 bytes for the int itself). So there is some inefficiency there. On the other hand, if we want to put them in an index table, we can use a fixed size type instead of a varint, which would avoid this and also make access more efficient, because we can directly address values in those tables without having to decode them first. So I think if we keep the tables, they should be of type (s)fixed32/64
instead of (s/u)int32/64
. It would still be an indirect access which likely is slower than inlining though.
We either need separate signed and unsigned integer types, or we need to zigzag unsigned integers as well, since non-zigzag negative numbers take so much extra space.
This is one of those cases where we are hitting the limits of protobuf encoding again. We know the type, so we could do the zigzag encoding ourselfs for sints and not for uints. For the C++ code this doesn't matter, because we do the zigzag encoding ourselves anyway, but for anybody using the protobuf encodings, we either need two tables, or they have to do the zigzag encoding outside the protobuf lib.
Glad to hear that inlining floats sounds like it is off the table.
I'm fine with putting keys and strings in separate tables, and within a container object if that is considered useful. I'll try changing my prototype to do that. Should we put all the attributes inside that message, or is there a case where readers just want to skip/copy the strings?
Good point that the integers-by-reference should be fixed-size instead of varint, since they will always be large if they didn't get inlined. I'll change my prototype and this .proto
file to do that.
If we talk about inlining signed integers at all, we are inherently doing bit-packing, so we have to explicitly talk about either zigzagging or sign extension, and zigzagging is probably the better choice of the two. All clients, no matter what language they are written in, will have to be able to unpack inlined integers.
On a different topic:
Closing in favor of #123
This is different then the more simple approach to adding lists and maps in #117. This is a complete overhaul of the way properties are encoded. The thought process behind this change is to allow for higher levels of compression of values by:
This also allows for null values.
Solves #75 and #62