ssl-hep / ServiceX

ServiceX - a data delivery service pilot for IRIS-HEP DOMA
BSD 3-Clause "New" or "Revised" License
20 stars 21 forks source link

Refresh Cached Datasets #850

Closed BenGalewsky closed 1 day ago

BenGalewsky commented 1 month ago

As a ServiceX analyzer I want to be able to refresh the list of replicas in a cached dataset so I can work around invalid replicas

Description

Caching datasets has improved performance of the system since we no longer look up DIDs once they have been resolved once. There are cases where the list of replicas in the cache are no longer valid and we need a way to refresh them.

Approach

From a data integrity point of view it makes sense to keep the old replica records around since they were used in generating existing transforms. We have foreign key constraints on the files and transform_result tables, so simply deleting the dataset will cause foreign key violations without deleting the old transforms and their results as well. These constraints help keep the database consistent.

Modify the Datasets Table

We can add a new column in the datasets table to record if a did is stale and needs to be looked up again the next time it is requested.

  1. Add a new boolean value to the datasets

    is_stale = db.Column(db.Boolean, default=False)
  2. Update the find_by_name method to only return rows where is_stale is False

Add New Endpoints

We will add new REST endpoints to allow users to manipulate the stale setting. For usability we should also have an endpoint that lists datasets.

  1. GET on /dataset to return a listing of all datasets along with the some statistics
  2. DELETE on /dataset/{id} to mark the dataset as stale and force a re-query next time

New CLI Commands

Add new commands to the CLI to allow users to easily interact with these new endpoints

servicex dataset list
servicex dataset delete
ponyisi commented 1 month ago

Hi @BenGalewsky - thanks for this explanation. I guess I'm perhaps a bit uncomfortable with the possibility that an "invalidate cache" request could cause the loading of 60,000 new rows into the files table even if absolutely nothing has changed, though... perhaps I'm prematurely over-optimizing?

ponyisi commented 1 month ago

Also... do we have a potential issue if a comma were to appear in a filename, since that seems to be used as a separator ... ?