protomaps / PMTiles

Cloud-optimized + compressed single-file tile archives for vector and raster maps
https://protomaps.com/docs/pmtiles/
BSD 3-Clause "New" or "Revised" License
2.02k stars 118 forks source link

Using python Reader object with cloud storage #38

Closed fscottfoti closed 2 years ago

fscottfoti commented 2 years ago

It seems like go-pmtiles is set up to read from cloud storage (S3, Google Storage etc), but the python version here is not. Is that correct? How hard would it be to add that? I use Google, but an S3 example would be great.

If the Reader API is robust, I can add it into a flask/fastapi endpoint pretty trivially, implement my own authentication etc, but it definitely seems like being able to read the tiles from a file in Storage (rather than locally) is the really powerful use case here.

bdon commented 2 years ago

Yep, we should refactor the current python reader to take an object that implements the transport part. Implementations that come to mind:

Open questions:

fscottfoti commented 2 years ago

I would think async/await would not be necessary - is there an advantage to making multiple io requests at the same time, e.g. for the metadata and the file? Seems like one has to wait on the other.

fscottfoti commented 2 years ago

It seems like all you need to do is pass in a get_bytes function and it would support anywhere the user gets the data. Do you need anything but the one function? You could keep passing in a file as an option and build a default get_bytes function for that case that uses mmap. The reader probably needs a test first for that kind of refactoring though.

fscottfoti commented 2 years ago

This PR implements a get_bytes function as one way to do this. I haven't tested it with Google Storage yet, but it seems to work with the local file. I also have another PR I could open which reformats the python code with black (like prettier in JS) here. Anyway it's just a draft for now - it would need some tests before merging.

fscottfoti commented 2 years ago

OK I got it working with GCS. Only hard part was I had to refactor load_directory so that there weren't so many separate calls for bytes from the server, but that ended up not being a big deal either and then it "just worked." Very cool idea - takes about .2 seconds to fetch a tile in GCS, but it's probably even faster when called from an app deployed in the same data center.

bdon commented 2 years ago

I would think async/await would not be necessary - is there an advantage to making multiple io requests at the same time, e.g. for the metadata and the file? Seems like one has to wait on the other.

Yeah, the requests for root directory -> possible leaf directory -> tile are all in serial. The async behavior is better in the case where you have a multithreaded server and want to cache and share results from different incoming clients, but I don't think most python web servers are well suited to take advantage of this.

It seems like all you need to do is pass in a get_bytes function and it would support anywhere the user gets the data. Do you need anything but the one function?

I think a single function interface is fine, it should support in some way a persistent Connection option to enable features like HTTP keepalive, cookies, or holding a reference to an authenticated S3 client.

Can you submit your PR for the reformatting first? I'll add some tests for the Python reader, as of now it's been mostly for debugging but ideally it should be flexible enough for your use case on GCS.

bdon commented 2 years ago

looked at your change sets and I think this looks good overall, I would propose something like this (pseudocode, not tested)

# initialize the mmap reader
with pmtiles.Mmap('my_file') as get_bytes:
    reader = pmtiles.Reader(self,get_bytes)

# this would be in your program
# goal is for PMTiles library to not have any external dependencies
from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_path)

def get_gcp_bytes(offset,length):
    return blob.download_as_bytes(start=offset,  end=offset+length-1)

reader = pmtiles.Reader(self,get_gcp_bytes)
bdon commented 2 years ago

Looked at this a bit more, one thing to watch out for is eventually the interface may need to be a bit more complicated to support #24 , as how this is exposed may vary between different storage clients.

The use case here is having directories cached in the Client and then the file on storage being modified. Without ETags the client would start returning bad offsets into the new data

(edit: relevant Google-specific issue https://github.com/googleapis/python-storage/issues/451 )

fscottfoti commented 2 years ago

Yes, that all makes sense.

It seems like all the Reader object needs is a get_bytes function and then that can be wrapped for each kind of Reader. It seems like each kind of reader (file, S3, GCS, etc) needs its own file with its own set of dependencies, etc (as you were pointing out). Probably the user never calls the Reader.get method, but calls the one on the GCS object which can check to see if the underlying file has been updated in storage before completing the call, etc.

As for the async part, I mean it seems like it would give you a small advantage for multiple requests the very first time you call it, but after that, the directory information will be cached (at least as I imagined it), so I still don't think async is very helpful.

But I think there are alternatives - at first I just wanted to see if it worked, and if was easy to do, and it does and it was.

bdon commented 2 years ago

WIP on this here: @fscottfoti https://github.com/protomaps/PMTiles/pull/42

only MmapSource is implemented right now, but I think this is the simplest way to support get_bytes, just pass a function that might be bound to an object; that object can be a S3 Client, HTTP session, etc. May need to change get_bytes return value to be (bytes,etag) in the future

bdon commented 2 years ago

The latest Python release has better I/O batching and supports get_bytes. An example of its use is the new AWS Lambda implementation that exists in the repo here: https://github.com/protomaps/PMTiles/blob/master/serverless/aws/lambda_function.py#L74

This should be adaptable to Google Cloud with only a few changes.

fscottfoti commented 2 years ago

Awesome thank you! - will take a look when I get a chance.