protomaps / PMTiles

Cloud-optimized + compressed single-file tile archives for vector and raster maps
https://protomaps.com/docs/pmtiles/
BSD 3-Clause "New" or "Revised" License
1.98k stars 115 forks source link

"pmtiles merge" CLI command? #403

Open mgibbs189 opened 3 months ago

mgibbs189 commented 3 months ago

I have a 120GB raster dataset, split into 2600 geotiffs. Extracting a .pmtiles from it hasn't been easy 🙃

Would you consider adding a pmtiles merge CLI command?

See below for the different approaches I've tried, and how a pmtiles merge command could help.

Option 1: gdal_merge of the geotiffs, then gdal_translate to .mbtiles

[-] Single-threaded [-] painfully slow [-] gdal requires ludicrous amount of free space (7+ TB)

Option 2: Convert each geotiff to its own .mbtiles file, then merge the .mbtiles files together

[+] multi-threaded, minus the merging [-] tippecanoe's tile-join doesn't support rasters

Option 3: gdal2tiles -> directory of .pngs, then mb-util to generate merged .mbtiles, then .pmtiles

[+] png creation multi-threaded [-] creating millions of pngs is torture on the filesystem [-] slow

Option 4 (proposed): Convert each geotiff into its own .mbtiles file, then pmtiles convert for each, then pmtiles merge to stitch all the pmtiles together

[+] multi-threaded [+] fast(?)

bdon commented 3 months ago

Option 1: gdal requires ludicrous amount of free space (7+ TB)

Does this also happen if you create a .vrt using gdalbuildvrt instead of using gdal_merge?

Then gdal_translate to pmtiles if you're on 3.8+: https://gdal.org/drivers/vector/pmtiles.html

mgibbs189 commented 3 months ago

@bdon here's the best approach I've found thus far:

gdalbuildvrt -o worldcover_v1.vrt Extracted/*.tif
gdal_translate -of MBTiles -co RESAMPLING=NEAREST worldcover_v1.vrt worldcover.mbtiles
pmtiles convert worldcover.mbtiles worldcover.pmtiles --tmpdir=./tmp/

re: gdal's pmtiles driver (vectors only?):

This driver supports reading and writing PMTiles datasets containing vector tiles, encoded in the MapVector Tiles (MVT) format.

The issue is that gdal_translate is single-threaded. The above process takes 4 days for a 120GB dataset.

If I could process each .tif separately (.tif -> .mbtiles -> .pmtiles), then merge them all together, that could potentially save a TON of time. Your pmtiles CLI commands seem to be way more optimized than gdal.

bdon commented 3 months ago

The issue is that gdal_translate is single-threaded. The above process takes 4 days for a 120GB dataset.

Maybe https://github.com/mapbox/rio-mbtiles is a multithreaded option? I've been exploring a derivative of that that can generate PMTiles directly.

mgibbs189 commented 3 months ago

The rio-mbtiles command is suited for small to medium (~1 GB) raster sources.

It seems like every solution thus far has caveats.

rio-mbtiles doesn't help much here, because I could just run gdal_translate -of MBTiles individually on each tif file, then run pmtiles convert individually on each mbtiles file to get a bunch ofpmtiles files.

The problem is still the same... squashing all the pmtiles into one.

Let's assume the simplest use-case (merging pmtiles with no overlaps). What does that process look like to build out pmtiles merge? Is it technically complex? Could I maybe sponsor it?

bdon commented 3 months ago

Let's assume the simplest use-case (merging pmtiles with no overlaps)

In most cases "no overlaps" is not actually feasible. If you have 2600 GeoTIFFs, and each one is being tiled to zoom 0, then all the resulting .pmtiles will have data at z=0, meaning pmtiles merge is not enough, we will also need to do image operations to mosaic tiles. z=0 is an extreme example but it's pretty likely that there is some spatial overlap between lower zooms, unless your use case is specific enough that there's really no overlap.

bdon commented 3 months ago

^ My solution to the above is that any correct solution needs to start with a .vrt and GDAL handles the raster mosaicing prior to tiling. I suggested rio-mbtiles because that has parallelism, and if we identify the limits on raster sources we ought to be able to work around them. One solution may be to use gdaladdo on the VRT prior to tiling.

mgibbs189 commented 3 months ago

I'll give rio-mbtiles another try and will report back.

Thanks for the clarification re: pmtiles merge.

mgibbs189 commented 3 months ago

@bdon For the 120GB dataset, rio mbtiles is showing 900 hours to finish... on a monster machine with 30 cores processing it.

Again, I love PMTiles but this hasn't been a great experience 🙃

The fastest approach I've found thus far is to create the millions of tiles via gdal2tiles.py (bleh), then mb-util to clump them together, then finally pmtiles convert.

Really, really hoping that you could give us raster users some better tooling, whether it involves pmtiles merge, some custom rio extension, etc.

bdon commented 3 months ago

Again, I love PMTiles but this hasn't been a great experience

The problems you are encountering seem to be related to your specific challenging dataset and tiling it, regardless of tile archive format. Could you make your dataset and current tooling available so other can pitch in to help?

mgibbs189 commented 3 months ago

It's the ESA WorldCover dataset: https://worldcover2021.esa.int/

The best approach thus far:

  1. gdalbuildvrt to generate a .vrt from the 2600+ extracted .tifs
  2. gdal2tiles.py to create Tiles directory from the .vrt (millions of png tiles)
  3. mb-util to build an .mbtiles file from the Tiles directory
  4. pmtiles convert to input the .mbtiles file and output a .pmtiles file

Every other solution I've tried has either failed (see the first comment in this thread), required absurd resources, or would've taken many days (weeks) to complete.

@bdon maybe you could give it a try as a proof-of-concept 😇

larsmaxfield commented 1 month ago

Have you looked into using libvips to join the TIFFs and then generate a tileset? Something like this:

libvips works as a pipeline. It's very quick and does not require keeping images in memory when manipulating them. There's also a pyvips binding. I use that.

mgibbs189 commented 1 month ago

@larsmaxfield thanks for the suggestion. That's very similar to Option 3 above (I'm hoping to avoid the tileset generation step). There's got to be a better way around it, but the likelihood of a pmtiles merge command isn't looking promising 😬

bdon commented 1 month ago

If you want the ESA WorldCover as tiles you can modify this script https://github.com/OvertureMaps/overture-tiles/blob/main/profiles/Base.java to convert Overture GeoParquet's ESA Worldcover dataset into tiles.