salesforce / summary-of-a-haystack

Codebase accompanying the Summary of a Haystack paper.
https://arxiv.org/abs/2407.01370
Apache License 2.0
68 stars 5 forks source link

Make dataset available on HF #1

Open NielsRogge opened 3 months ago

NielsRogge commented 3 months ago

Hi folks,

Congrats on this very interesting work. Would you be up for making your dataset available on the Hugging Face hub, so that people can do:

from datasets import load_dataset

dataset = load_dataset("salesforce/summary-of-a-haystack")

? This would make it easier available for the community. Here's how to load a HF dataset from JSON files: https://huggingface.co/docs/datasets/loading#json.

Moreover it could be linked to the paper page: https://huggingface.co/papers/2407.01370, here's how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

Let me know if you need any help!

Kind regards,

Niels ML Engineer @ HF

Alex-Fabbri commented 3 months ago

Great suggestion! I just added it here! Let us know if you encounter any issues.

from datasets import load_dataset

dataset = load_dataset("Salesforce/summary-of-a-haystack")['train']
NielsRogge commented 3 months ago

Thanks so much! Pinging @severo to make the dataset viewer work

severo commented 3 months ago

The issue is that the dataset contains 10 rows, and each row is a very big JSON. The dataset viewer doesn't currently support this use case.

Maybe @lhoestq has some ideas. The dataset is here: https://huggingface.co/datasets/Salesforce/summary-of-a-haystack