Alternative dir structure for a more compact storage and proximity clustering

nyurik commented 2 years ago

I would like to propose some changes to the directory structure, but these might be totally irrelevant due to my misunderstanding.

Current directory entry is fixed at 17bytes, stores x,y as individual values, and requires x,y sorting order. I think all of those may benefit from a bit of restructuring:

Combine x and y values into a single interleaved value (z-curve). This will make tiles more locale-clustered, possibly reducing the number of range requests (web).
Make the combined x,y value size depend on the zoom level, rounding to the nearest byte boundary. zooms 0..3 -- 1 byte, 4..7 -- 2 bytes, etc. All leaf nodes are required to be the same zoom, making the directory entries the same size only inside a single directory block, rather than everywhere.
For the root dir, it could still be 17bytes (although it seems 6 bytes is a bit too much -- zoom 16..19 only requires 5 bytes, so unless pmtiles wants to support more... and even that can be made flex)

bdon commented 2 years ago

This is addressed in the upcoming v3 design:

https://github.com/protomaps/PMTiles/blob/spec-v3/spec/v3_design_doc.md

This goes one step further than combining x and y by combining z,x,y into a single value.
I think you are referring to varint-encoding since most values are small? This is also addressed in the v3 design as directory entries are varint-encoded when serialized to disk, although the fixed-width encoding is used for efficient access in memory.

bdon commented 2 years ago

see notes in https://github.com/protomaps/PMTiles/issues/41 as well.

experimentally - running https://github.com/protomaps/PMTiles/tree/spec-v3/spec/converter on the z14 planetiler output can get to an effective <2 bytes per tile entry, after RLE + delta + varint + gzip compression steps.

nyurik commented 2 years ago

Thanks @bdon , this is awesome!!! A few minor questions:

Would it make sense to default to Brotli rather than gzip by default? Moving forward, it seems most web serving will switch to Brotli soonish.
is there a specification for the TileID? The Hilbert curve usually covers 2D, and I am not certain how the 3D variant will look like
Is there a way to encode sequential TileIDs more efficiently, e.g. if you know ahead of time that a block represents sequential tiles and there is no real reason to have each individual tile ID encoded separately?
I love the RunLength idea - makes encoding of the identical items easier. How much space does it use when the run length is zero?

bdon commented 2 years ago

I'm leaning towards making the index and tile compression be configurable, with the compression type stored in header. Brotli does not have a breadth of implementations across different languages and environments like gzip; PMTiles clients will use https://github.com/101arrowz/fflate for decompression which works across browsers and Lambda.
The specification is just the 2D tile IDs arranged in sequence where Z0 is at TileID=0, Z1 starts at TileID 1, Z2 starts at TileID 5 and so on.
I think you mean a "dense" representation like https://github.com/openaddresses/TileBase (and I think https://github.com/mactrem/com-tiles/tree/main/packages/spec has some provisions for this as well).
- In general, I think adding sparse directories complicates the design because then you need to store whether a directory is sparse or dense and branch the logic based off of that. I think a sparse directory + delta encoding of TileIDs accomplishes the same thing after compression, because a complete block of sequential tiles will have "1" as the delta-encoded TileID which should be easy to compress out in the gzip step.
- The runlength=1 is the general case (runlength=0 is used to indicate a leaf pointer), and it will add an additional 4 bytes to every entry in memory (we could use 2 bytes, meaning the max run length is 65535, which is exceeded empirically in a planet-scale ocean at z14ish, so that would require multiple entries). But again, with varint+gzip compression this should be optimized out in the directory serialization step.

So in general, the current "v2" design punts on compactness optimizations by having a 1:1 match between directory representation at rest and in memory. Of the 4 compression steps I want to add: RLE, Varint, Delta, GZIP, the Varint+Delta parts should do a sufficient job of removing redundancy.

nyurik commented 2 years ago

Thanks, it does make sense. Have you had any trial runs comparing the space cost of sparse for a planet-file vs dense? You could also have a bit flag field to indicate the presence of fields, plus a new value with the number of dense entries to follow, just like with the run length. The entry would then have tileid, optional blocklength (N) (default=1), optional runlength (default=1), plus an array of N offsets+lengths. TBH, I don't think this will make the design significantly more complicated given how many decoding steps are already required. But again, this might be a premature optimization until you can run a space test.

Do you have a sample code of decoding tileID? Seems like it would have to be a weird one like

if id < 2^0:
  return 0,0,0
if id < 2^1:
  x, y = decode(id - 2^0)
  return 1, x, y
if id < 2^2:
  x, y = decode(id - 2^1)
  return 2, x, y
...

bdon commented 2 years ago

Yes, that decoding routine is roughly right, though you can use a for-loop and an accumulator:

https://github.com/protomaps/PMTiles/blob/spec-v3/spec/converter/tile_id.go#L71

I think the dense format would complicate the encoding routine, which is currently fairly simple for those 4 steps: https://github.com/protomaps/PMTiles/blob/spec-v3/spec/converter/convert.go#L53 and I have to implement across at least six different languages. Any dense run would be terminated by a RunLength > 1 since the TileIDs of two consecutive entries would not be contiguous. It's worth experimenting with and maybe including if the size savings are significant; the current design achieves a total compressed directory size of 80-90 MB for a z14 planet basemap dataset which I'm satisfied with.

bdon commented 2 years ago

Thinking a bit out loud, the way I would probably do this is to re-use RunLength for dense sequences: a positive RunLength N means that N (offset,length) pairs follow corresponding to sequential Tile IDs, a negative RunLength N means the next -N tile IDs are an identical offset/length.

nyurik commented 2 years ago

LOL, I was thinking along the same lines :))) RunLength should be moved up though, before the offset/len fields (just to keep it logical).

nyurik commented 2 years ago

WRT decoding tileid -- feels expensive tbh - the higher the zoom, the more steps are required to parse it. That said, tight loops without memory access are very fast, so might not be too bad.

bdon commented 2 years ago

@msbarry has benchmarked the decoding/encoding routine when the Hilbert code was still in here https://github.com/onthegomap/planetiler/pull/266/files#diff-9e784e32e5ed72b6cd67c89789e4b4f0fe3082810b9f839b357b0ccd11de58d4 and the decoding step was trivial relative to other things happening.

msbarry commented 2 years ago

We found a hilbert implementation fast enough that computing the start offsets was a bottleneck, the fastest way we found to do it was to precompute a lookup table, then iterate through it backwards to get zoom for an index, something like:

  private static final int[] ZOOM_START_INDEX = new int[16];

  static {
    int idx = 0;
    for (int z = 0; z <= 15; z++) {
      ZOOM_START_INDEX[z] = idx;
      int count = (1 << z) * (1 << z);
      idx += count;
    }
  }

  private static int startIndexForZoom(int z) {
    return ZOOM_START_INDEX[z];
  }

  private static int zoomForIndex(int idx) {
    for (int z = 15; z >= 0; z--) {
      if (ZOOM_START_INDEX[z] <= idx) {
        return z;
      }
    }
  }

nyurik commented 2 years ago

If the entire leaf is the same zoom, can we just keep the zoom separate?

bdon commented 2 years ago

There is no concept of zoom in leaves as that is implied by the TileId. A directory might cross zoom levels; this can be as many as eight zoom levels in the root. The TileId encodes (z,x,y) together so that a set of tiles at different zoom levels can be represented with only a unique set of integers.

nyurik commented 2 years ago

I was referring to All leaf directories in a single directory must have the same Z value. (new in v2). rule. Come to think of it, its not a big deal because 1) its delta encoded, and 2) decoder is unlikely to decode the value anyway - most often it will be a "encode and lookup", rather than iterate over existing. And even if it is decoding, it would cache the current zoom and just check the bounds. So yes, a single value, while not intuitive, might work just as well.

nyurik commented 2 years ago

P.S. once the v3 spec is stable, I could try to hack this in Rust unless someone already has implemented it (i haven't found it).

Also, it might make sense to get some feedback on v3 spec from those who worked on OSM pbf data and similar formats - i.e. @b-r-u @joto @pka @bjornharrtell ...

bdon commented 2 years ago

There isn't much overlap between the OSM PBF format and PMTiles, as PMTiles is restricted purely to Z/X/Y addressing and not geometry or geographic coordinates - for that you might be interested in https://github.com/protomaps/OSMExpress which is meant to mirror OSM PBF transactionally on disk.

It's worth attempting to benchmark what a change in the entry data structure would look like to encode non-duplicates as runlengths - this would make the underlying entry struct non-fixed-width, because it would have a nested array of Offsets/Lengths for each TileID. My hypothesis is that this could save 20-30% on the in-memory representation at the cost of making access and logic more complicated. I'm not sure if the 20-30% would translate to savings on disk after compression because the relevant columns (RunLength, TileID) are both delta-encoded and laid out in columns in the current draft implementation, so they should compress almost perfectly.

bdon commented 2 years ago

"Nested entries" for serialization in 5ec3a8188e2cfc5f9e5a23d88ac1c1f793ab457b

this counterintuitively increased the total compressed index size from ~92 MB for a planet z14 basemap tileset to ~101 MB. Double checking if I made any errors in the implementation...

nyurik commented 2 years ago

Can it be due to negative numbers brings stored as 8 bytes each, with all the high bits set?

bdon commented 2 years ago

No, because the signed varint encoding is zigzag-encoded, so they should still be small.

The root cause here is another optimization which zeroes out offsets if the tile data is contiguous, leaving only the length. The code as-is only does the zeroing within a Run, while the previous code worked across all Entries in a serialized directory. I've changed this to work across all offsets like before; the compressed size beats the non-nested very slightly on a small Tippecanoe dataset (1438 -> 1426 bytes), benchmarking this on planet now

bdon commented 2 years ago

Nested method is 91.3 megabytes total on test dataset vs 92.8 megabytes non-nested

nyurik commented 2 years ago

Thanks @bdon for doing the analysis! So 1.6+% savings but only applicable to the metadata. It does make online access slightly faster, but doesn't seem to be significant enough to make a difference, unless we can come up with some trick to optimize it even further.

If it is not too hard, could you try Brotli compression, just to get a ballpark estimate of how much we could potentially save in the future? Not a biggie though.

bdon commented 2 years ago

It shouldn't be hard, I will also try zstandard as well tonight, though I imagine the savings will be marginal for small files + high entropy numbers instead of strings

bdon commented 2 years ago

running brotli test in https://github.com/protomaps/PMTiles/tree/brotli

bdon commented 2 years ago

Note on the nested memory savings:

With the existing non-nested implementation, the complete planet has about 40,000,000 entries, where each entry is 8 + 4 + 8 + 4 (tileid,runlength,offset,length) bytes = ~960 MB in-memory, which seems reasonable to me, given that the purpose of this design as a whole is not needing to have the entire index in memory at once.

By shortening Runlength to 2 bytes (the vast majority are just zeroes) instead we can shave this down to ~880 MB, but that doesn't seem like a huge win.

bdon commented 2 years ago

Brotli reduces the index size from 92.8 megabytes to 73.6 megabytes with the compression quality set to 11 (slowest), which is nice savings; it won't be practical until most runtimes (python, node, browser via DecompressionStream web standard) support it though

bdon commented 2 years ago

zstd with the default compression ratio (3) produces a 91.2 MB index, so almost the same as gzip; the higher compression levels are not implemented in this pure-Go port, and support is generally behind Brotli

msbarry commented 2 years ago

Nice! <1GB for the in-memory representation is great - a tile server could easily hold that in memory and serve the raw tiles with a single lookup.

For planetiler, I looked into brotli support for java, it's really close but might need some pushing to make it generally usable. There is code for an official java wrapper around the native brotli library but it's not published to maven central. jvm-brotli was publishing it, but it looks abandoned now, and brotli4j might be a better alternative - but it doesn't support m1 macs yet so I can't use it... Here's a recent reddit discussion.

My 2 cents would be to stick with gzip for compatibility, and work towards brotli when there's broad enough support across browsers and languages...

bdon commented 2 years ago

here's the real-world z14 test dataset I'm using to benchmark compression:

https://protomaps-static.sfo3.digitaloceanspaces.com/dir.bin.gz

The gzipped file is 293 MB, the ungzipped file is 981 MB. The file is simply a sequence of TileId uint64, RunLength uint32, Offset uint64, Length uint32 in little-endian order: 24 bytes per entry: 981,227,232 bytes total size = 40,884,468 entries total.

So a naive gzipping of this dataset is 981/293 = 3.34x compression ratio.

The PMTiles compression technique as is results in a 92 MB unpartitioned directory, a ~10.6x compression ratio.

Analysis of how each compression step affects the result (removing one step at a time from the complete algorithm)

Delta encoding of TileID: 171 MB Varint encoding: 118 MB Columnar transformation: 102 MB Zeroing of contiguous offsets (only effective in a clustered archive): 182 MB General-purpose gzip compression: 243 MB

So in order of most to least effective of these 5 steps: gzip, offset zeroing, delta encoding, varint encoding, columnar

nyurik commented 2 years ago

I keep thinking that it might make sense to significantly reduce memory footprint and use u16 instead of 32. A tiny change to the encoding logic is worth 15% server memory reduction

bdon commented 2 years ago

@nyurik for most languages in practice I think it will not make a difference since the struct as a whole will need to be aligned to 8 bytes on 64 bit systems; 22 will get padded to 24 bytes. can you check this in rust?

kylebarron commented 2 years ago

By the way a quick Python script to convert the row-based data into a columnar Parquet format for testing file sizes:

import struct
from pathlib import Path

import click
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

# From https://docs.python.org/3/library/struct.html#format-characters
STRUCT_FORMAT = "<QLQL"
ROW_BYTE_SIZE = struct.calcsize(STRUCT_FORMAT)

def main(path: Path = Path("/Users/kyle/Downloads/dir.bin")):
    with open(path, "rb") as f:
        buf = f.read()

    total_rows = int(981227232 / ROW_BYTE_SIZE)

    col0 = np.zeros(total_rows, dtype=np.uint64)
    col1 = np.zeros(total_rows, dtype=np.uint32)
    col2 = np.zeros(total_rows, dtype=np.uint64)
    col3 = np.zeros(total_rows, dtype=np.uint32)

    i = 0
    with click.progressbar(length=len(buf)) as bar:
        for row in struct.iter_unpack(STRUCT_FORMAT, buf):
            col0[i] = row[0]
            col1[i] = row[1]
            col2[i] = row[2]
            col3[i] = row[3]

            i += 1
            bar.update(ROW_BYTE_SIZE)

    names = ["TileId", "RunLength", "Offset", "Length"]
    table = pa.Table.from_arrays([col0, col1, col2, col3], names=names)
    pq.write_table(
        table,
        "dir.parquet",
        compression="zstd",
    )

if __name__ == "__main__":
    main()

Note that PyArrow doesn't support writing with delta compressions

nyurik commented 2 years ago

@nyurik for most languages in practice I think it will not make a difference since the struct as a whole will need to be aligned to 8 bytes on 64 bit systems; 22 will get padded to 24 bytes. can you check this in rust?

@bdon in Rust you have full control over the struct memory layout if you want to do that. So if I wanted to optimize for memory usage (which I probably would to cache the whole tile table), I would set up a packed struct. Same btw is available in C#.

bdon commented 2 years ago

@nyurik ok, good to know! To elaborate on what I said about memory usage above, the compressed directories are designed so you don't need to have the entire thing loaded in memory at once; most tile serving systems are going to have a very small "hot" set of frequently viewed tiles, and a typical client should LRU cache uncompressed directories (the logic for using up to a maximum byte amount of cache might be tricky...)

The use cases where you might want to have "all the data" are more niche and ephemeral, like on initial creation or re-encoding the entire directory set, and I don't feel like those situations are worth optimizing for.

Statistics on values of RunLength in test dataset:

3,111,310 of 40,884,468 entries have RunLength > 1 (7.6%)

# Number of samples = 3111310
# Min = 2
# Max = 6910615
#
# Mean = 102.89581655316492
# Standard deviation = 9206.430911998288
# Variance = 84758370.13739765
#
# Each ∎ is a count of 62224
#
      2 ..  691063 [ 3111243 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 691063 .. 1382124 [      45 ]:
1382124 .. 2073185 [      14 ]:
2073185 .. 2764246 [       2 ]:
2764246 .. 3455307 [       1 ]:
3455307 .. 4146368 [       1 ]:
4146368 .. 4837429 [       2 ]:
4837429 .. 5528490 [       1 ]:
5528490 .. 6219551 [       0 ]:
6219551 .. 6910612 [       1 ]:

Max RunLength is 6910612 which would require 106 sequential entries with max(uint16) RunLength

bdon commented 2 years ago

zoomed in histogram view on only entries with RunLength > max(uint16)

# Number of samples = 594
# Min = 65710
# Max = 6910615
#
# Mean = 339514.3888888893
# Standard deviation = 571100.6973181095
# Variance = 326156006477.23083
#
# Each ∎ is a count of 10
#
  65710 ..  750201 [ 535 ]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 750201 .. 1434692 [  40 ]: ∎∎∎∎
1434692 .. 2119183 [  11 ]: ∎
2119183 .. 2803674 [   3 ]:
2803674 .. 3488165 [   0 ]:
3488165 .. 4172656 [   1 ]:
4172656 .. 4857147 [   2 ]:
4857147 .. 5541638 [   1 ]:
5541638 .. 6226129 [   0 ]:
6226129 .. 6910620 [   1 ]:

so ~600 entries in total would need to be split into multiple entries at max 6910612, we are still far away from hitting max(uint32) by a factor of 600.

Real-world use cases are unlikely to hit max(uint32), example: the same dataset but tiles created down to zoom level 18

bdon commented 2 years ago

Deduplication stats:

There are 50,127 tile contents that are duplicated at least once in the dataset

offset= 32441304901 #entries= 1561559
offset= 16920670396 #entries= 438687
offset= 32439101503 #entries= 232993
offset= 7473610749 #entries= 114595
offset= 16919704241 #entries= 94957
offset= 7473156562 #entries= 37830
offset= 3650798046 #entries= 31656
offset= 3650651795 #entries= 15007
offset= 32593848682 #entries= 14545
offset= 1543614639 #entries= 7762
offset= 49930265511 #entries= 7433
offset= 32548414452 #entries= 6494
offset= 1543535435 #entries= 6043
offset= 17003600263 #entries= 5834
offset= 49921790276 #entries= 4924
offset= 49951084821 #entries= 4764
offset= 68658123689 #entries= 4658
offset= 61490012849 #entries= 4125
offset= 21505185621 #entries= 3714
offset= 56404313394 #entries= 3119
offset= 596899953 #entries= 2413
offset= 21500954762 #entries= 2409

nyurik commented 2 years ago

I just realized that if all tiles are gzip-compressed, each one of them will be wasting at least 2 bytes -- 1f 8b, and possibly more -- up to the whole 10 byte header. From wiki, each gzip starts with:

a 10-byte header, containing a magic number (1f 8b), the compression method (08 for DEFLATE), 1-byte of header flags, a 4-byte timestamp, compression flags and the operating system ID.

optional extra headers as allowed by the header flags, including the original filename, a comment field, an "extra" field, and the lower half of a CRC-32 checksum for the header section.[3]

Note that the full CRC is stored in the 8-byte footer. I think we can significantly optimize the size of the pmtiles by removing these dups, but of course this would mean that the pmtiles header will have to contain enough metadata to decode all payload variants, e.g. by containing the 10-byte header, etc.

bdon commented 2 years ago

The intention is for each tile to be a valid Gzip payload at-rest; removing those redundant fields would require a server to reconstruct a valid payload. So I believe we have to live with the overhead.

nyurik commented 2 years ago

If the client will treat each range read as independent, than sure, but we will want range reads that include multiple tiles, even if some tiles will be discarded - so the client will parse the range read to break it into multiple tiles, it might as well concatenate the same prefix to all.

At the end of the day, its all about the cost. My back of the napkin calculation: z0..14 is 268,435,456 tiles. The number of dup tiles from your comment is 3,111,310 (is this correct? seems low). So that means 265,324,146 unique blobs, or ~0.5GB of compressed(!) diff just for the first 2 bytes, or 2,5GB if we manage to remove the whole 10 byte header. Seems to be a fairly significant chunk to be worth the efforts I think?

bdon commented 2 years ago

The number of unique blobs is more like 40,000,000 for a planet-scale basemap.

I think what you are describing is to store tiles as raw Deflate streams instead of gzip (gzip is just a framing around deflate) but it is unknown to me how decompressor implementations across multiple runtimes handle raw streams. We want any decompressor to be able to just do something like

tile_data = decompressor.inflate(compressed_data)

without having to fiddle around with adding headers or footers based on the stated compression type, etc.

10 bytes * 40m = 400 megabytes of overhead from the gzip headers, or about ~0.5% of storage overhead.

bdon commented 2 years ago

related discussion: https://stackoverflow.com/questions/1574168/deflate-compression-browser-compatibility-and-advantages-over-gzip although not 100% relevant because we control the decompressor via a bundled fflate library.

In the long term we want to move to the browser-native DecompressionStream API for performance

bdon commented 2 years ago

Closing this one out as it's more or less settled here: https://github.com/protomaps/PMTiles/blob/master/spec/v3/spec.md

Compression is specified by an enum in the header. Technically, you could mix Gzip and uncompressed tiles and save some space and indicate Compression=Unknown in the header. This makes clients that work with ReadableStreams a bit complicated, however.

protomaps / PMTiles

Alternative dir structure for a more compact storage and proximity clustering #62