Open ashim95 opened 1 year ago
Update: Here is the yaml file we used:
max_seq_len: 4096
seed: 28
model_name_or_path: ~/huggingface_cache/Llama-2-7b-hf
# Tokenizer
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
models:
-
model_name: ${model_name_or_path}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: mixed
pretrained: true
token: <HF Token>
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
load_path: # Add your (optional) Composer checkpoint path here!
device_eval_batch_size: 4
precision: fp32
# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: FULL
icl_tasks:
-
label: mnli_mismatched
dataset_uri: scripts/eval/local_data/mnli_mismatched.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [8]
icl_task_type: multiple_choice
metric_names:
InContextLearningMultipleChoiceAccuracy
prompt_string: '' # this goes at the beginning of each input
example_delimiter: "\n" # this goes between fewshot examples
continuation_delimiter: '' # this separates questions from answers
And here is a sample example from the jsonl
file:
{
"premise": "Your contribution helped make it possible for us to provide our students with a quality education.",
"hypothesis": "Your contributions were of no help with our students' education.",
"label": 2,
"idx": 0,
"query": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:",
"choices": [
"entailment",
"neutral",
"contradiction"
],
"gold": 2,
"context": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:\n"
}
Please let me know if you need any more details.
Thanks, -- ashim
We also tried running the evaluation using lm-evaluation-harness. Here are the numbers with the two libraries:
❓ Question
I am trying to benchmark
llama-2-7b
on the GLUE benchmark for in-context learning. But the accuracy I get for MNLI (mismatched validation
) is 35.22 for both zero-shot and 8-shot. My questions are:InContextLearningMultipleChoiceTaskDataset
? Is there another recommend way to implement this?PS: Also ran evaluation for the
qqp
task: 36.82% for 0-shot and 63.09 for 8-shot.Any help would be greatly appreciated.
Thank you,
Additional context