versatiles-org / versatiles-rs

VersaTiles - A toolbox for converting, checking and serving map tiles in various formats.
https://versatiles.org
MIT License
57 stars 2 forks source link

support Google Cloud Storage (gs://) as data source #22

Closed MichaelKreil closed 6 months ago

hinricht commented 11 months ago

How much work would that actually be ? It would be handy for us, but https also work so no hurry.

MichaelKreil commented 11 months ago

There are libraries for connecting to Google Cloud Storage. But the challenge is how to authenticate, especially in Google Cloud Run.

MichaelKreil commented 11 months ago

As a workaround you can use Cloud Storage FUSE with Cloud Run to mount the bucket as volume: https://cloud.google.com/run/docs/tutorials/network-filesystems-fuse

I tested it. Basically it works, but versatiles returns an error without debug information, when reading the latest planet in combination with running in Cloud Run. Other files / other platforms do work. I've no idea why. Maybe accessing a 50GB file with GCS FUSE in a tiny GCR node has some side effects because of caching/temporary files? Maybe I should test reading byte ranges using dd.

But I also learned a little bit about how authentication works. GCS FUSE uses Application Default Credentials. Alternatively you can use the Google internal metadata server to fetch an ID token. The Rust library google-cloud-storage also uses ADC/metadata server, but adds a lot of unneccessary complexity, because it covers the whole Storage-API. I don't want to include a full library with many dependencies for using just one function for a tiny feature.

All we need is a simple way of accessing the data using byte range request. We could do that directly via: https://storage.googleapis.com/$BUCKET/$PATH/planet-latest.versatiles, but we need some kind of authorisation. Maybe we can implement that ourselfes using ADC and metadata server. I believe no, that it's actually not that complicated. Maybe just an addtional request for getting a token that can be used as GET parameter ...

hinricht commented 11 months ago

combination with running in Cloud Run. Other files / other platforms do work. I've no idea why. Maybe accessing a 50GB file with GCS FUSE in a tiny GCR node has some side effects because of caching/temporary files? Maybe I should test reading byte ranges using dd.

Doesn't sound like the way to go. Another project here had performance issues with the FUSE driver, so native access to the bucket would be better in this regard as well.

The Rust library google-cloud-storage also uses ADC/metadata server, but adds a lot of unneccessary complexity, because it covers the whole Storage-API. I don't want to include a full library with many dependencies for using just one function for a tiny feature.

I know adding the full google-cloud-storage rust lib might blow up the image, but I'd prefer to go this way because then we don't need to worry about it any more, in contrast to a self-baked solution that might be much slimmer but adds a different complexity by needing to maintain it in the future, across all possible API changes. Would a dedicated image which includes the google lib be an option, to keep the footprint small for the main versatile image ?