stadiamaps / pmtiles-rs

Rust implementation of PMTiles
Apache License 2.0
54 stars 8 forks source link

Handle backend data changes #40

Open nyurik opened 4 months ago

nyurik commented 4 months ago

In case of an HTTP or S3 bucket backends, the pmtiles file could get modified while being used, and should be handled properly.

In order to make this possible, the backend API would need to change. Currently, we have read and read_exact. Most of the time, read_exact is used. The read is actually only used once - to read the initial block (main directory). I propose we convert

// current backend func
pub fn read(&self, offset: usize, length: usize) -> PmtResult<Bytes> { ... }
// proposed backend func
pub fn new(length: usize, /* any params to init new backend */) -> PmtResult<(Self, Bytes)> { ... }

This way each backend can self initialize and read first block of data, and while doing it, may also set some monitors or get etag data - returning both the new backend instance and the first block of data. We may want to make all backend creation non-public, but this is not certain yet.

The next step is how to handle the actual data change. The actual backend will return some new PmtError::UnderlyingDataModified (naming?), but we need a new logic for the fn get_tile(...). It could simply pass through that error to the caller, and let the caller handle it (e.g. caller may close and re-open the backend, invalidating the cache as well). Or we could do that inside get_tile, but the logic might get tricky - for example backend requests could start hitting two load-balanced endpoints that just happened to have different versions of the same file, thus producing different etag values.

/// Get a tile, and if any of the backend requests return unexpected etag, return an error
fn get_tile(&self, ...) {
  // recursively navigate through directories (some might be cached) to find the right entry
  self.find_tile_entry(...)
  // get data for that entry
  self.backend.read_exact(...)
}

One possible solution is to introduce a new get_tile_with_retry(&mut self, ...) -- allowing self to be modified, but I don't think its a workable solution -- most of the users would want to parallelize this call, so &mut self is not possible. Instead, we may need to introduce a thicker wrapper that stores backend inside an Arc<RwLock<Backend>> - in which case we could still use get_tile(&self,...), but inside it will get a read lock, get the tile, and re-create the backend if it is outdated.

fn get_tile(&mut self) {
  let mut retries = 0;
  loop {
    // this is an Arc<RwLock<AsyncPmTilesReader<Backend>>> value
    let reader = self.backend.read().unwrap();
    let value = reader.get_tile().await;
    if value != Err(PmtError::UnderlyingDataModified) || retries >= 2 {
      return value;
    }
    retries += 1;
    drop(reader);
    let writer = self.backend.write().unwrap();

    // pseudo-code -- need to replace RwLock's content with new reader
    mem::replace(writer, create_new_backend_instance());

    // TODO:  make sure two simultaneous calls do not re-create the same backend twice -- keep a counter or something
  }
}

The problem here is that the caller may still need to know if the backend was modified in order to reset some other internal cache

lseelenbinder commented 3 months ago

Thanks for opening the discussion @nyurik!

I think we definitely need to pass any notification back to the caller. For example, in Stadia Maps' own usage, we'd have a (potentially) a few layers of caches to invalidate if we want to support data updates without reloading. If then, the caller wants to retry, it's easy enough with tooling.

Regarding the HTTP/S3 mechanisms, I'm not sure etag alone is sufficient (since it's not universal) and there's other headers (e.g., last-modified) which might be useful. For S3 specifically, x-amz-version-id could be useful to indicate the version changed (or even restrict to a specific object version).

I would like to see a few concrete use cases before we design an API around it so we actually know what user needs are.

nyurik commented 3 months ago

@lseelenbinder I am ok to simply return an error for now. I believe etag is fairly universal, but I agree it might not cover all cases. How about this API, without any other code:

use pmtiles::reqwest::header::HeaderName;

// we already re-export reqwest, so it is easy for users to pass
// `Some(pmtiles::reqwest::header::ETAG)`
// If `None`, no header monitoring
HttpBackend::try_from(..., monitor_header: Option<HeaderName>) {...}

// For more advanced cases, we MAY eventually introduce
// some callback fn, but YAGNI until needed.

So it is totally up to the user to handle or not handle this, and passing None is identical to current API. File monitoring is also left out because it can also be done by the user without API changes (we may add it later).

P.S. Note that there is no way to reset AsyncPmTilesReader to accept the new value - which I think is OK as it will force the user to re-instantiate it.