huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

Solution to issue: #7080 Modified load_dataset function, so that it prompts the user to select a dataset when subdatasets or splits (train, test) are available #7191

Closed negativenagesh closed 1 week ago

negativenagesh commented 1 month ago

Feel free to give suggestions please..

This PR is raised because of issue: https://github.com/huggingface/datasets/issues/7080

image

This PR gives solution to https://github.com/huggingface/datasets/issues/7080

  1. Checking whether the dataset has splits or subdatasets.
  2. Printing the available splits/subdatasets.
  3. Asking the user to choose which one to load.
  4. Loading only the selected dataset based on the user's input.

Key Changes:

  1. Available Splits/Subdatasets: The code checks for available splits/subdatasets using builder_instance.info.splits.keys().
  2. User Prompt: If splits are found, it prints them out and prompts the user to select one.
  3. Loading Based on User Input: The dataset is loaded based on the user's choice.

This way, the dataset loading function will interactively prompt the user to select which subdataset or split they want to load instead of automatically loading all of them.

lhoestq commented 1 month ago

I think the approach presented in https://github.com/huggingface/datasets/pull/6832 is the one we'll be taking.

Asking user input is not a good idea since load_dataset is used a lot in server that don't have someone in front of them to select a split