Benchmarking GLUE tasks for in-context learning

ashim95 commented 1 year ago

❓ Question

I am trying to benchmark llama-2-7b on the GLUE benchmark for in-context learning. But the accuracy I get for MNLI (mismatched validation) is 35.22 for both zero-shot and 8-shot. My questions are:

During you benchmarking, did you run the models for any classification tasks? Any experimental results you share would be great.
Currently, I am using the format prescribed by InContextLearningMultipleChoiceTaskDataset? Is there another recommend way to implement this?

PS: Also ran evaluation for the qqp task: 36.82% for 0-shot and 63.09 for 8-shot.

Any help would be greatly appreciated.

Thank you,

Additional context

ashim95 commented 1 year ago

Update: Here is the yaml file we used:

max_seq_len: 4096
seed: 28
model_name_or_path: ~/huggingface_cache/Llama-2-7b-hf 

# Tokenizer
tokenizer:
  name: ${model_name_or_path}
  kwargs:
    model_max_length: ${max_seq_len}

models:
-
  model_name: ${model_name_or_path}
  model:
    name: hf_causal_lm
    pretrained_model_name_or_path: ${model_name_or_path}
    init_device: mixed
    pretrained: true
    token: <HF Token>
  tokenizer:
    name: ${model_name_or_path}
    kwargs:
      model_max_length: ${max_seq_len}

load_path: # Add your (optional) Composer checkpoint path here!

device_eval_batch_size: 4
precision: fp32

# FSDP config for model sharding
# either use multiple GPUs, or comment FSDP out
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: FULL

icl_tasks:
-
  label: mnli_mismatched
  dataset_uri: scripts/eval/local_data/mnli_mismatched.jsonl # ADD YOUR OWN DATASET URI
  num_fewshot: [8]
  icl_task_type: multiple_choice
  metric_names:
InContextLearningMultipleChoiceAccuracy
  prompt_string: '' # this goes at the beginning of each input
  example_delimiter: "\n" # this goes between fewshot examples
  continuation_delimiter: '' # this separates questions from answers

And here is a sample example from the jsonl file:

{
  "premise": "Your contribution helped make it possible for us to provide our students with a quality education.",
  "hypothesis": "Your contributions were of no help with our students' education.",
  "label": 2,
  "idx": 0,
  "query": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:",
  "choices": [
    "entailment",
    "neutral",
    "contradiction"
  ],
  "gold": 2,
  "context": "Premise:\nYour contribution helped make it possible for us to provide our students with a quality education.\n\nHypothesis:\nYour contributions were of no help with our students' education.\n\nLabel:\n"
}

Please let me know if you need any more details.

Thanks, -- ashim

ashim95 commented 1 year ago

We also tried running the evaluation using lm-evaluation-harness. Here are the numbers with the two libraries:

mosaicml / llm-foundry

Benchmarking GLUE tasks for in-context learning #707

❓ Question

Additional context