Compute metrics about datasets similarity

huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.

https://huggingface.co/docs/dataset-viewer

Apache License 2.0

680 stars 76 forks source link

Compute metrics about datasets similarity #396

Open severo opened 2 years ago

severo commented 2 years ago

It would be useful to find, for a given dataset, which are the nearest datasets in relation to their content.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo commented 1 year ago

It would be a new task, that could be used on the Hub.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo commented 1 year ago

Reopening in the light of https://huggingface.co/spaces/librarian-bots/huggingface-datasets-semantic-search / https://twitter.com/vanstriendaniel/status/1689336183959203840.

Instead of searching by similarity in the metadata, though, the idea would be to check similarity in the data itself.

severo commented 6 months ago

See https://huggingface.co/spaces/asoria/datasets-similarity-tool by @AndreaFrancis