Avoid providing "mixed" cache while refreshing a dataset

severo commented 1 year ago

When a dataset is being refreshed, or even after having been refreshed, its cache can be incoherent.

Case 1

Reported by @AndreaFrancis: during a dataset refresh, the search was enabled in the viewer, but the "first rows" were not shown. The split-duckdb-index step had finished, while the split-first-rows-from... steps were still being computed. The data shown by the viewer were not coherent, and it would be hard for a user to understand what was occurring.

Case 2

Seen with https://huggingface.co/datasets/jbrendsel/ECTSum. All the steps had been computed, and the config-size step had an error (PreviousStepFormatError). The erroneous cache entry remained after fixing the issue and refreshing the dataset. The reason is a previous step (dataset-config-names) now gives an error, and the rest of the cache entries are never cleaned, so they stay in the database even if useless. See https://github.com/huggingface/datasets-server/issues/1582#issuecomment-1727151243 and https://github.com/huggingface/datasets-server/issues/1285.

Proposal

Give each "DAG" execution an identifier, and use the same identifier for all the endpoints. To ensure coherence between the responses, change the identifier we use in the API only when all the steps have been refreshed. If we send the identifier in the response, the client can also use the identifier value to check the coherence. If there is no previous identifier in the database, use the identifier of the current incomplete DAG (the dataset viewer already handles an incomplete state).

Once the new DAG execution has finished, we can delete the cache entries with other identifiers.

AndreaFrancis commented 1 year ago

Once the new DAG execution has finished, we can delete the cache entries with other identifiers.

Pros:

We will have always something to show in the viewer

Cons:

I think we could end up with our collections growing considering that these are some of the implementation details: We would have to add an extra field on cache and job collection maybe "dag_version", this new field should be added in the unique constraints index by kind. We will probably have to add an index for this new field Double the amount of data in our collections in the worst cases (duplicated content) Will probably have to run a new job runner or something like that to save the current_dag_version and update it when all the DAG is completed

Maybe initially a simpler approach: Once a job_runner finishes, delete all its children and backfill Pros:

Dont need too much change in our current infrastructure
It could be good to have in the db only what we need Cons:
Dataset viewer would end up waiting for the new information but it used to be quick. We have delays only when the queue is overloaded.

WDYT @huggingface/datasets-server

severo commented 1 year ago

I like the first option more, because it provides total separation between two runs. It means that we don't have to worry about obsolete cache entries anymore, and possibly it will be easier to clean the associated assets too, if they are referenced by a "run_id".

Thanks for the good points you raised (increase of the database size, taking care of the new field in the indexes, maybe one additional step).

lhoestq commented 1 year ago

For case 1 we could just delete the children cache entries when we click refresh in the admin app no ?

severo commented 1 year ago

I think we could take example on GitHub Actions.

runs/6338193861 represents a DAG run.

It has two attempts (I manually run again the "failed jobs", ie: 1 out of 2)

And each job has its id:

runs/6338193861/job/17214797233 for step 1 in attempt 1 (green)
runs/6338193861/job/17216602357 for step 1 in attempt 2 (green)
runs/6338193861/job/17214798018 for step 2 in attempt 1 (red)
runs/6338193861/job/17216602696 for step 2 in attempt 2 (green)

I'll see if we can adapt our model without complexifying it too much. The good thing, if we succeed, is that it will be easier to choose which versions we want to keep, or when we want to delete them. Currently, we "mutate" a common state, and it's a mess.

severo commented 1 year ago

For case 1 we could just delete the children cache entries when we click refresh in the admin app no ?

as a short-term solution, it's a good idea: add a parameter to /force-refresh to delete all the dataset's cache entries before refreshing the first step (or it could be another admin step: /recreate-dataset?dataset=...&priority=...).

It will not solve the general issue, but will allow to quickly fix a specific dataset, at the expense of turning the dataset viewer down on the Hub during the update.

severo commented 10 months ago

I propose the following strategy for now:

with https://github.com/huggingface/datasets-server/pull/2244: only use operations.update() and operations.delete() to control a dataset (no more partial updates of some steps and not other ones)
in the webhook: instead of just updating a dataset, first delete its cache completely then recreate
still let the backfill script partially update only the missing entries.

what do you think @AndreaFrancis @lhoestq

severo commented 10 months ago

it would remove the need for https://github.com/huggingface/moon-landing/issues/7080, btw (internal): the revision would normally always be the last one.

AndreaFrancis commented 10 months ago

in the webhook: instead of just updating a dataset, first delete its cache completely then recreate

it will also remove the assets right?

severo commented 10 months ago

Yes: assets and cached assets. It does nothing more (parquet, duckdb, and parquet metadata are ignored for now)

lhoestq commented 10 months ago

Sounds good to me !

huggingface / dataset-viewer