Closed BenGalewsky closed 1 day ago
Hi @BenGalewsky - thanks for this explanation. I guess I'm perhaps a bit uncomfortable with the possibility that an "invalidate cache" request could cause the loading of 60,000 new rows into the files
table even if absolutely nothing has changed, though... perhaps I'm prematurely over-optimizing?
Also... do we have a potential issue if a comma were to appear in a filename, since that seems to be used as a separator ... ?
As a ServiceX analyzer I want to be able to refresh the list of replicas in a cached dataset so I can work around invalid replicas
Description
Caching datasets has improved performance of the system since we no longer look up DIDs once they have been resolved once. There are cases where the list of replicas in the cache are no longer valid and we need a way to refresh them.
Approach
From a data integrity point of view it makes sense to keep the old replica records around since they were used in generating existing transforms. We have foreign key constraints on the
files
andtransform_result
tables, so simply deleting the dataset will cause foreign key violations without deleting the old transforms and their results as well. These constraints help keep the database consistent.Modify the Datasets Table
We can add a new column in the datasets table to record if a did is stale and needs to be looked up again the next time it is requested.
Add a new boolean value to the
datasets
Update the
find_by_name
method to only return rows whereis_stale
is FalseAdd New Endpoints
We will add new REST endpoints to allow users to manipulate the stale setting. For usability we should also have an endpoint that lists datasets.
/dataset
to return a listing of all datasets along with the some statistics/dataset/{id}
to mark the dataset as stale and force a re-query next timeNew CLI Commands
Add new commands to the CLI to allow users to easily interact with these new endpoints