huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
702 stars 77 forks source link

Avoid providing "mixed" cache while refreshing a dataset #1823

Closed severo closed 10 months ago

severo commented 1 year ago

When a dataset is being refreshed, or even after having been refreshed, its cache can be incoherent.

Case 1

Reported by @AndreaFrancis: during a dataset refresh, the search was enabled in the viewer, but the "first rows" were not shown. The split-duckdb-index step had finished, while the split-first-rows-from... steps were still being computed. The data shown by the viewer were not coherent, and it would be hard for a user to understand what was occurring.

Case 2

Seen with https://huggingface.co/datasets/jbrendsel/ECTSum. All the steps had been computed, and the config-size step had an error (PreviousStepFormatError). The erroneous cache entry remained after fixing the issue and refreshing the dataset. The reason is a previous step (dataset-config-names) now gives an error, and the rest of the cache entries are never cleaned, so they stay in the database even if useless. See https://github.com/huggingface/datasets-server/issues/1582#issuecomment-1727151243 and https://github.com/huggingface/datasets-server/issues/1285.

Proposal

Give each "DAG" execution an identifier, and use the same identifier for all the endpoints. To ensure coherence between the responses, change the identifier we use in the API only when all the steps have been refreshed. If we send the identifier in the response, the client can also use the identifier value to check the coherence. If there is no previous identifier in the database, use the identifier of the current incomplete DAG (the dataset viewer already handles an incomplete state).

Once the new DAG execution has finished, we can delete the cache entries with other identifiers.

AndreaFrancis commented 1 year ago

Once the new DAG execution has finished, we can delete the cache entries with other identifiers.

Pros:

Cons:

Maybe initially a simpler approach: Once a job_runner finishes, delete all its children and backfill Pros:

WDYT @huggingface/datasets-server

severo commented 1 year ago

I like the first option more, because it provides total separation between two runs. It means that we don't have to worry about obsolete cache entries anymore, and possibly it will be easier to clean the associated assets too, if they are referenced by a "run_id".

Thanks for the good points you raised (increase of the database size, taking care of the new field in the indexes, maybe one additional step).

lhoestq commented 1 year ago

For case 1 we could just delete the children cache entries when we click refresh in the admin app no ?

severo commented 1 year ago

I think we could take example on GitHub Actions.

runs/6338193861 represents a DAG run.

It has two attempts (I manually run again the "failed jobs", ie: 1 out of 2)

And each job has its id:

I'll see if we can adapt our model without complexifying it too much. The good thing, if we succeed, is that it will be easier to choose which versions we want to keep, or when we want to delete them. Currently, we "mutate" a common state, and it's a mess.

severo commented 1 year ago

For case 1 we could just delete the children cache entries when we click refresh in the admin app no ?

as a short-term solution, it's a good idea: add a parameter to /force-refresh to delete all the dataset's cache entries before refreshing the first step (or it could be another admin step: /recreate-dataset?dataset=...&priority=...).

It will not solve the general issue, but will allow to quickly fix a specific dataset, at the expense of turning the dataset viewer down on the Hub during the update.

severo commented 10 months ago

I propose the following strategy for now:

what do you think @AndreaFrancis @lhoestq

severo commented 10 months ago

it would remove the need for https://github.com/huggingface/moon-landing/issues/7080, btw (internal): the revision would normally always be the last one.

AndreaFrancis commented 10 months ago

in the webhook: instead of just updating a dataset, first delete its cache completely then recreate

it will also remove the assets right?

severo commented 10 months ago

Yes: assets and cached assets. It does nothing more (parquet, duckdb, and parquet metadata are ignored for now)

lhoestq commented 10 months ago

Sounds good to me !