Closed Rohan138 closed 7 months ago
See https://github.com/openfeedback/superhf/blob/main/src/superhf/data.py for an example of dataset loading for fine-tuning data--handle each dataset separately to get train and test.
Don't need to implement loading for the Relative Pretrain Perf datasets--Eleuther's lm-eval harness will take care of that.
Dataset loader for the following identifiers/datasets implemented. Let me know if you need other datasets supported.
dataset_identifier (str): The identifier for the dataset. Supported identifiers are:
Note that we won't need this for running relative pretraining performance evals using the lm_eval harness, since it handles loading these datasets. But we might still use this for loading data fed into an unadaptability method.
We can tear out the current torchvision loaders; MNIST stuff has been moved to the
mnist
branch.Note that for the CAIS cluster, we should have our loaders check if the dataset is already in the
/datasets
folder.We should prioritize the UFM and finetuning datasets and just use
lm_eval
for RPP: