huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.2k stars 2.68k forks source link

Optional Content Warning for Datasets #4163

Open TristanThrush opened 2 years ago

TristanThrush commented 2 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is.

We now have hate speech datasets on the hub, like this one: https://huggingface.co/datasets/HannahRoseKirk/HatemojiBuild

I'm wondering if there is an option to select a content warning message that appears before the dataset preview? Otherwise, people immediately see hate speech when clicking on this dataset.

Describe the solution you'd like A clear and concise description of what you want to happen.

Implementation of a content warning message that separates users from the dataset preview until they click out of the warning.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Possibly just a way to remove the dataset preview completely? I think I like the content warning option better, though.

Additional context Add any other context about the feature request here.

mariosasko commented 2 years ago

Hi! You can use the extra_gated_prompt YAML field in a dataset card for displaying custom messages/warnings that the user must accept before gaining access to the actual dataset. This option also keeps the viewer hidden until the user agrees to terms.

HannahKirk commented 2 years ago

Hi @mariosasko, thanks for explaining how to add this feature.

If the current dataset yaml is:

---
annotations_creators:
- expert
language_creators:
- expert-generated
languages:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: HatemojiBuild
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- hate-speech-detection
---

Can you provide a minimal working example of how to added the gated prompt?

Thanks!

leondz commented 2 years ago
---
annotations_creators:
- expert
language_creators:
- expert-generated
languages:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: HatemojiBuild
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- hate-speech-detection
extra_gated_prompt: "This repository contains harmful content."
---

+ enable User Access requests under the Settings pane.

There's a brief guide here https://discuss.huggingface.co/t/how-to-customize-the-user-access-requests-message/13953 , and you can see the field in action here, https://huggingface.co/datasets/mozilla-foundation/common_voice_7_0/blob/main/README.md (you need to agree the terms in the Dataset Card pane to be able to access the files pane, so this comes up 403 at first).

And a working example here! https://huggingface.co/datasets/DDSC/dkhate :) Great to be able to mitigate harms in text.

leondz commented 2 years ago

-- is there a way to gate content anonymously, i.e. without registering which users access it?

Breakend commented 2 years ago

+1 to @leondz's question. One scenario is if you don't want the dataset to be indexed by search engines or viewed in browser b/c of upstream conditions on data, but don't want to collect emails. Some ability to turn off the dataset viewer or add a gating mechanism without emails would be fantastic.