protomaps / PMTiles

Cloud-optimized + compressed single-file tile archives for vector and raster maps
https://protomaps.com/docs/pmtiles/
BSD 3-Clause "New" or "Revised" License
2.02k stars 118 forks source link

Split a pmtiles file #25

Closed msbarry closed 2 years ago

msbarry commented 2 years ago

Some hosts (like github pages) have maximum file sizes. Alternatives like https://github.com/phiresky/sql.js-httpvfs provide a way to split the tile archive until it is less than that max file size (https://github.com/phiresky/sql.js-httpvfs/blob/master/create_db.sh). Would it be possible for the pmtiles reader and writer to optionally support splitting a pmtiles file?

bdon commented 2 years ago

Thought about this for a bit and here's what I think are the benefits/drawbacks:

The case of GitHub pages seems to be meant for versioned code/docs and some associated assets, so I don't think it's a great fit as a primary target for tile archive hosting, though being free+fast is nice and you can accomplish the same thing with expanding to directories/archives. Are there other examples out there where we need to split archives to a max piece size? 32-bit systems might be one but I'd rather not consider that in scope.

msbarry commented 2 years ago

Agree that since the goal of pmtiles is to combine many files into one it may not make sense to split them back out again... According to https://github.com/phiresky/sql.js-httpvfs, the benefit they see for splitting a large file that you make byte range requests to from the client are:

This is needed if your hoster has a maximum file size. It can also be a good idea generally depending on your CDN since it allows selective CDN caching of the chunks your users actually use and reduces cache eviction.

Also using something like S3 is it possible to allow only range requests? A concern hosting a tileset in S3 would be a request comes from a client missing a range header and they accidentally start downloading the whole thing, which could run up bandwidth costs quickly. A split archive would partially mitigate that concern, but maybe it's not really an issue in practice?

bdon commented 2 years ago

It's not possible on raw S3 to allow only range requests. That concern is somewhat mitigated by having clients implement a rudimentary check as shown on this line: https://github.com/protomaps/PMTiles/blob/master/js/index.src.mjs#L71

In practice, it can be an issue, but it's not unique to PMTiles; the other cloud-optimized formats like COG have the same drawback. The best solution for now is to run a proxy in front of your bucket such as https://github.com/protomaps/go-pmtiles , but of course that's no longer just S3 :)

msbarry commented 2 years ago

It's not possible on raw S3 to allow only range requests. That concern is somewhat mitigated by having clients implement a rudimentary check as shown on this line: https://github.com/protomaps/PMTiles/blob/master/js/index.src.mjs#L71

OK thanks, that check helps prevent accidental full downloads, but there's still the issue of intentional full downloads, which could start to be an issue with a 100gb full planet tileset hosted on s3 since each full download would cost the owner $10 in egress fees.

I was thinking of using pmtiles for the planetiler demo site (~500MB mbtiles file on github pages) but if splitting a pmtiles archive doesn't make sense then I can stick with the current approach of extracting all of the tiles to individual files.

bdon commented 2 years ago

OK thanks, that check helps prevent accidental full downloads, but there's still the issue of intentional full downloads, which could start to be an issue with a 100gb full planet tileset hosted on s3 since each full download would cost the owner $10 in egress fees.

Yeah, I agree the intentional linking/leeching is a concern - the basemap downloads I offer at http://protomaps.com/downloads are limited to at most a hundred or so megabytes, and my stopgap solutions for larger maps is proxy-based like above. I'm optimistic about the long-term solve here being market pressure downwards on bandwidth in the next few years, for example if/when Cloudflare R2 becomes available.

bdon commented 2 years ago

I'm going to close this issue about archive splitting for now; I think the ETag features enabled by a single file take precedence over working around max file size limits. For the planetiler demo site, I've spun up a demo tile server using https://github.com/protomaps/go-pmtiles on an unmetered bandwidth server:

https://bdon.github.io/planetiler-demo/ (endpoint http://free-tiles.protomaps.com/planetiler/{z}/{x}/{y}.pbf)

Open to suggestions on how to organize the URL structures or metadata, or access for hosting regular updates.