How can I use datasets hosted in Hugging Face?

Hi! Our code uses the gin configs for creating classes for training, evaluation, dataset creation, and model building. All of these have a scope under which they are built. For example "D/" means dataset scope, "M/" denotes model scope, "P/TRAIN" is train scope, and similarly "P/EVALUATE" is the evaluation scope. Variables defined within a scope are used when instantiating the class.

An example config for the dataset social_iqa looks like this:

D/P3SOCIALIQA/P3Dataset:
  # path to load the dataset. Note that P3Dataset class essentially calls load_data from Dataset class, which needs dataset_path variable defined. Since the dataset_path is same for train and eval, both classes defined under the scopes "D/P3SOCIALIQA/TRAIN" and "D/P3SOCIALIQA/EVAL" can use this variable and we don't need to define dataset_path separately for train and eval. 
  dataset_path = ["huggingface", "social_i_qa"]

# this is needed for every class we built, so that all the required classes are built under the scope defined. 
D/P3SOCIALIQA/TRAIN/build.cls = @P3Dataset

# all other additional details for the train dataset class
D/P3SOCIALIQA/TRAIN/P3Dataset:
  batch_size = 32
  split = "train"
  max_examples_per_dataset = 500_000

To create the same dataset for evaluation, we adjust the scope name and necessary arguments:
D/P3SOCIALIQA/EVAL/build.cls = @P3Dataset
# for multiple choice tasks, we use "gen" for generation tasks. 
D/P3SOCIALIQA/EVAL/InterfaceInfo.interface = "mc"
# additional details, most of them are specific to Prompt source datasets
D/P3SOCIALIQA/EVAL/P3Dataset:
  split = "validation"
  metrics = ["accuracy"]
  round_robin_template = True
  ignore_templates = ["Check if a random answer is valid or not"]
  include_templates = "original"

Additionally, we use the capital letters for the scope name to differentiate the scope name from the class or method name.

If you are using datasets that use prompt source format, you can use P3Dataset class and add a config for your dataset similar to social_iqa as above. Otherwise, you can use the FlatDataset class. We did this for additional Bigbench and Flan datasets that have input and output strings. You can add the config of your dataset in either p3_t5xl.gin or any gin file that gets passed to gin_files argument in https://github.com/r-three/phatgoose/blob/4fd579099f013cdb98c7fb75b67f716a124e122f/colm/experiments/bash_scripts/train_single_task_loralinear.sh#L50

If your dataset needs different formatting, you can make a child class from FlatDataset and modify any methods. Please ensure to have a tokenized example similar to __get__item__ in FlatDataset so that later code can properly batch input_ids and target_ids. Currently, our code supports T5 encoder-decoder models for routing among a mixture of experts. I will let you know once we add decoder model support.

r-three / phatgoose

How can I use datasets hosted in Hugging Face? #4

What