huggingface / dataset-viewer

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
https://huggingface.co/docs/datasets-server
Apache License 2.0
642 stars 65 forks source link

Viewer shows outdated cache after renaming a repo and creating a new one with the old name #2964

Open albertvillanova opened 1 week ago

albertvillanova commented 1 week ago

Reported by @lewtun (internal link: https://huggingface.slack.com/archives/C02EMARJ65P/p1719818961944059):

If I rename a dataset via the UI from D -> D' and then create a new dataset with the same name D, I seem to get a copy instead of an empty dataset Indeed it was the dataset viewer showing a cached result - the git history is clean and there's no files in the new dataset repo

julien-c commented 1 week ago

is is possible to also list the solution we discussed in that thread? (moving from repo name to _id)

To make discussion a bit more efficient

lhoestq commented 1 week ago

in particular:

_id field in hf.co/api/datasets - guaranteed immutable for a given repo

Some first thoughts:

Then regarding the API:

I also considered using _id everywhere as the source of truth, but I anticipate it will just move the problem elsewhere to the place we will cache the _id <-> repo name mapping (repo name is always needed to read/write to repos and also for the dataset-viewer API)

albertvillanova commented 6 days ago

Thanks for the complementary information, @lhoestq.

So, basically we would need a complete refactoring of all the logic to identify repositories and you also think that this would just move the problem elsewhere... :thinking:

I am wondering if instead we could effectively face the real underlying problem, that is, properly handle the repository renaming event, even if a new repository with the old name is created.