mapbox / rio-mbtiles

A plugin command for the Rasterio CLI that exports a raster dataset to an MBTiles 1.1 SQLite file
MIT License
125 stars 36 forks source link

Warped vrt [work in progress] #28

Closed sgillies closed 7 years ago

sgillies commented 7 years ago

This code uses Rasterio's new WarpedVRT class, creating one for the entire process, reads 256 x 256 windows from them, writes the arrays to MemoryFile instances of the appropriate format, then copies the MemoryFile bytes to a SQLite database.

sgillies commented 7 years ago

WarpedVRT is slower by a factor of 2 for a local file.

rio-mbtiles 1.3.0 (with multiprocessing pool):

(old-mbtiles) MapBox-FC:blog sean$ time rio mbtiles ~/code/rasterio/tests/data/RGB.byte.tif --zoom-levels 6..10 -o rgb.mbtiles -j 4 -f PNG
ERROR:rasterio._gdal:CPLE_OpenFailed in No such file or directory
ERROR:rasterio._gdal:CPLE_OpenFailed in No such file or directory
ERROR:rasterio._gdal:CPLE_OpenFailed in No such file or directory
ERROR:rasterio._gdal:CPLE_OpenFailed in No such file or directory

real    0m1.407s
user    0m3.455s
sys 0m0.229s

this PR with a WarpedVRT (w/ multiple threads) but mostly serial:

(rio-blog-post) MapBox-FC:blog sean$ time rio mbtiles ~/code/rasterio/tests/data/RGB.byte.tif --zoom-levels 6..10 -o rgb.mbtiles -j 4 -f PNG
/Users/sean/envs/rio-blog-post/lib/python3.6/site-packages/rasterio/windows.py:562: RasterioDeprecationWarning: use 'width' attribute instead
  RasterioDeprecationWarning)
/Users/sean/code/rio-mbtiles/mbtiles/scripts/cli.py:210: RasterioDeprecationWarning: use 'height' attribute instead
  boundless=True)
/Users/sean/envs/rio-blog-post/lib/python3.6/site-packages/rasterio/io.py:120: NotGeoreferencedWarning: Dataset has no geotransform set. Default transform will be applied (Affine.identity())
  nodata=nodata, **kwargs)

real    0m3.208s
user    0m3.085s
sys 0m0.103s

Half as fast because it's only running on one of my mac's cores.

Some notes about why it's complicated to "just" multi-process this below.

sgillies commented 7 years ago

@jdmcbr would you be willing to give this a review?

I've been curious about a WarpedVRT approach. It is a little easier than the concurrent method in master, but not as fast. GDAL's warper has some parallelization but it doesn't run on both cores of my mac like the code in master.

The concurrent method in master has a weakness, however: you can't use it to access datasets via HTTPS unless you pin the number of workers to 1. I saw errors about invalid credential chains when trying accessing an s3:// dataset from multiple processes. I'm not enough of an expert on OpenSSL or TLS to understand what the issue is.

Accessing s3:// datasets from multiple Python threads has no such problem. However, we're not able to access our source dataset or WarpedVRT from different threads because GDAL isn't thread-safe in that way. I haven't yet begun to look into whether there are speedups to be had by initializing worker threads with their own src and vrt.

I'm kind of inclined to merge this and make the WarpedVRT the reference implementation of rio-mbtiling, leaving versions with more concurrency as homework for developers.

sgillies commented 7 years ago

@jdmcbr thanks for the comments. If the slowdown is too much for you – and it is for me, the more I consider it – I think I may let this branch rest unmerged.

There is trouble accessing https: datasets from process pool workers, but rio-mbtiles is about creating tiles that entirely cover a dataset: you really do need to download the entire dataset to create the mbtiles and so we might as well do that up front.

jdmcbr commented 7 years ago

@sgillies Sounds good to me. On my question about the motivation for the PR, I didn't mean to make you repeat your answer above on https access. A maybe better way of asking my question would have been to say: what benefits does having multiple threads in https dataset access provide that outweigh the loss of multiple workers for tile processing? I thought there might be interactions with other services, or some other future developments, that would push for going the VRT route. Possibly just due to my own bias, with processing speed on files on a local filesystem as the most important factor for me, I thought there must be something else I was missing.

sgillies commented 7 years ago

I wanted to be sure that we weren't missing out on any obvious VRT performance wins and the inability to make HTTPS connections with the processing pool looked bad at first. Now that I've turned my back on tiling S3/HTTPS datasets, it might be worth looking into whether a VRT per process pool worker could get us any speed ups.