Rewrite `data.py` to load from HuggingFace datastets

mukobi / Unadaptable-Foundation-Models

MIT License

3 stars 0 forks source link

Rewrite `data.py` to load from HuggingFace datastets #1

Closed Rohan138 closed 7 months ago

Rohan138 commented 7 months ago

We can tear out the current torchvision loaders; MNIST stuff has been moved to the mnist branch.

Note that for the CAIS cluster, we should have our loaders check if the dataset is already in the /datasets folder.

We should prioritize the UFM and finetuning datasets and just use lm_eval for RPP:

mukobi commented 7 months ago

See https://github.com/openfeedback/superhf/blob/main/src/superhf/data.py for an example of dataset loading for fine-tuning data--handle each dataset separately to get train and test.

Don't need to implement loading for the Relative Pretrain Perf datasets--Eleuther's lm-eval harness will take care of that.

owen-yeung commented 7 months ago

Dataset loader for the following identifiers/datasets implemented. Let me know if you need other datasets supported.

dataset_identifier (str): The identifier for the dataset. Supported identifiers are:

"cyber" for the 'cais/wmdp-corpora' dataset with the 'cyber-forget-corpus' subset.
"harmfulqa" for the 'declare-lab/HarmfulQA' dataset.
"toxic" for the 'allenai/real-toxicity-prompts' dataset.
"pile" for the 'NeelNanda/pile-10k' dataset.

mukobi commented 7 months ago

Note that we won't need this for running relative pretraining performance evals using the lm_eval harness, since it handles loading these datasets. But we might still use this for loading data fed into an unadaptability method.