Closed severo closed 10 months ago
Once the new DAG execution has finished, we can delete the cache entries with other identifiers.
Pros:
Cons:
Maybe initially a simpler approach: Once a job_runner finishes, delete all its children and backfill Pros:
WDYT @huggingface/datasets-server
I like the first option more, because it provides total separation between two runs. It means that we don't have to worry about obsolete cache entries anymore, and possibly it will be easier to clean the associated assets too, if they are referenced by a "run_id".
Thanks for the good points you raised (increase of the database size, taking care of the new field in the indexes, maybe one additional step).
For case 1 we could just delete the children cache entries when we click refresh in the admin app no ?
I think we could take example on GitHub Actions.
runs/6338193861 represents a DAG run.
It has two attempts (I manually run again the "failed jobs", ie: 1 out of 2)
And each job has its id:
I'll see if we can adapt our model without complexifying it too much. The good thing, if we succeed, is that it will be easier to choose which versions we want to keep, or when we want to delete them. Currently, we "mutate" a common state, and it's a mess.
For case 1 we could just delete the children cache entries when we click refresh in the admin app no ?
as a short-term solution, it's a good idea: add a parameter to /force-refresh to delete all the dataset's cache entries before refreshing the first step (or it could be another admin step: /recreate-dataset?dataset=...&priority=...).
It will not solve the general issue, but will allow to quickly fix a specific dataset, at the expense of turning the dataset viewer down on the Hub during the update.
I propose the following strategy for now:
what do you think @AndreaFrancis @lhoestq
it would remove the need for https://github.com/huggingface/moon-landing/issues/7080, btw (internal): the revision would normally always be the last one.
in the webhook: instead of just updating a dataset, first delete its cache completely then recreate
it will also remove the assets right?
Yes: assets and cached assets. It does nothing more (parquet, duckdb, and parquet metadata are ignored for now)
Sounds good to me !
When a dataset is being refreshed, or even after having been refreshed, its cache can be incoherent.
Case 1
Reported by @AndreaFrancis: during a dataset refresh, the search was enabled in the viewer, but the "first rows" were not shown. The
split-duckdb-index
step had finished, while thesplit-first-rows-from...
steps were still being computed. The data shown by the viewer were not coherent, and it would be hard for a user to understand what was occurring.Case 2
Seen with https://huggingface.co/datasets/jbrendsel/ECTSum. All the steps had been computed, and the
config-size
step had an error (PreviousStepFormatError
). The erroneous cache entry remained after fixing the issue and refreshing the dataset. The reason is a previous step (dataset-config-names
) now gives an error, and the rest of the cache entries are never cleaned, so they stay in the database even if useless. See https://github.com/huggingface/datasets-server/issues/1582#issuecomment-1727151243 and https://github.com/huggingface/datasets-server/issues/1285.Proposal
Give each "DAG" execution an identifier, and use the same identifier for all the endpoints. To ensure coherence between the responses, change the identifier we use in the API only when all the steps have been refreshed. If we send the identifier in the response, the client can also use the identifier value to check the coherence. If there is no previous identifier in the database, use the identifier of the current incomplete DAG (the dataset viewer already handles an incomplete state).
Once the new DAG execution has finished, we can delete the cache entries with other identifiers.