Missing files or bugs in evaluation code?

ch-shin commented 3 months ago

Hi team, I have two questions regarding datacomp evaluation.

Which evaluation config should we use for datacomp competition evaluation? (heavy.yaml?) I got some results with medium.yaml, but I was unable to find "Core" and "Extended" scores.
Running evaluation heavy.yaml (light.yaml or medium.yaml worked) gave the below error --- can we get some help? Thank you 🙏


╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/ubuntu/research_nfs/dclm/eval/eval_openlm_ckpt.py:549 in <module>      │
│                                                                              │
│   546                                                                        │
│   547                                                                        │
│   548 if __name__ == "__main__":                                             │
│ ❱ 549 │   main()                                                             │
│   550                                                                        │
│                                                                              │
│ /home/ubuntu/research_nfs/dclm/eval/eval_openlm_ckpt.py:512 in main          │
│                                                                              │
│   509 │   │   │   data_name = result["val_data"][0].split("/")[-2]           │
│   510 │   │   │   eval_metrics["downstream_perpexity"][data_name] = result[" │
│   511 │                                                                      │
│ ❱ 512 │   icl_results = evaluate(eval_model, tokenizer, eval_cfg)            │
│   513 │   eval_metrics["icl"] = icl_results                                  │
│   514 │                                                                      │
│   515 │   date_format = "%Y_%m_%d-%H_%M_%S"                                  │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/torch/utils/_ │
│ contextlib.py:115 in decorate_context                                        │
│                                                                              │
│   112 │   @functools.wraps(func)                                             │
│   113 │   def decorate_context(*args, **kwargs):                             │
│   114 │   │   with ctx_factory():                                            │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                               │
│   116 │                                                                      │
│   117 │   return decorate_context                                            │
│   118                                                                        │
│                                                                              │
│ /home/ubuntu/research_nfs/dclm/eval/eval_openlm_ckpt.py:148 in evaluate      │
│                                                                              │
│   145 │   )                                                                  │
│   146 │   icl_tasks_w_categories = list(map(lambda x: x["label"], icl_tasks_ │
│   147 │                                                                      │
│ ❱ 148 │   evaluators, logger_keys = build_icl_evaluators(                    │
│   149 │   │   cfg.icl_tasks, tokenizer, cfg.max_seq_len, cfg.device_eval_bat │
│   150 │   )                                                                  │
│   151 │   in_memory_logger = InMemoryLogger()  # track metrics in the in_mem │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/llmfoundry/ut │
│ ils/builders.py:219 in build_icl_evaluators                                  │
│                                                                              │
│   216 │   │   │   │   os.remove(destination_path)                            │
│   217 │   │   │   dist.barrier()                                             │
│   218 │   │   │                                                              │
│ ❱ 219 │   │   │   dataloaders = get_icl_task_dataloader(                     │
│   220 │   │   │   │   icl_cfg.icl_task_type,                                 │
│   221 │   │   │   │   icl_cfg.dataset_uri,                                   │
│   222 │   │   │   │   tokenizer,                                             │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/composer/data │
│ sets/in_context_learning_evaluation.py:1323 in get_icl_task_dataloader       │
│                                                                              │
│   1320 │   │   │   )                                                         │
│   1321 │   │   return result_dls                                             │
│   1322 │   else:                                                             │
│ ❱ 1323 │   │   return build_icl_dataloader(                                  │
│   1324 │   │   │   icl_task_type,                                            │
│   1325 │   │   │   dataset_uri,                                              │
│   1326 │   │   │   tokenizer,                                                │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/composer/data │
│ sets/in_context_learning_evaluation.py:1145 in build_icl_dataloader          │
│                                                                              │
│   1142 │   │   │   │   │   │   │   │   │   │   │   │    fewshot_random_seed= │
│   1143 │   │   effective_batchsize = batch_size                              │
│   1144 │   elif icl_task_type == 'question_answering':                       │
│ ❱ 1145 │   │   dataset = InContextLearningQATaskDataset(dataset_uri,         │
│   1146 │   │   │   │   │   │   │   │   │   │   │   │    tokenizer,           │
│   1147 │   │   │   │   │   │   │   │   │   │   │   │    max_seq_len,         │
│   1148 │   │   │   │   │   │   │   │   │   │   │   │    pad_tok_id,          │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/composer/data │
│ sets/in_context_learning_evaluation.py:153 in __init__                       │
│                                                                              │
│    150 │   │   │   │   get_file(dataset_uri, destination_path, overwrite=Tru │
│    151 │   │   dataset = load_dataset('json', data_files=destination_path, s │
│    152 │   │   self.samples = list(                                          │
│ ❱  153 │   │   │   dataset.map(lambda examples: {                            │
│    154 │   │   │   │   'context': examples['context'],                       │
│    155 │   │   │   │   'answer': examples['answer'],                         │
│    156 │   │   │   │   'aliases': examples['aliases']                        │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/datasets/arro │
│ w_dataset.py:563 in wrapper                                                  │
│                                                                              │
│    560 │   │   else:                                                         │
│    561 │   │   │   self: "Dataset" = kwargs.pop("self")                      │
│    562 │   │   # apply actual function                                       │
│ ❱  563 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kw │
│    564 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance( │
│    565 │   │   for dataset in datasets:                                      │
│    566 │   │   │   # Remove task templates if a column mapping of the templa │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/datasets/arro │
│ w_dataset.py:528 in wrapper                                                  │
│                                                                              │
│    525 │   │   │   "output_all_columns": self._output_all_columns,           │
│    526 │   │   }                                                             │
│    527 │   │   # apply actual function                                       │
│ ❱  528 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kw │
│    529 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance( │
│    530 │   │   # re-apply format to the output                               │
│    531 │   │   for dataset in datasets:                                      │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/datasets/arro │
│ w_dataset.py:2953 in map                                                     │
│                                                                              │
│   2950 │   │   │   │   │   leave=False,                                      │
│   2951 │   │   │   │   │   desc=desc or "Map",                               │
│   2952 │   │   │   │   ) as pbar:                                            │
│ ❱ 2953 │   │   │   │   │   for rank, done, content in Dataset._map_single(** │
│   2954 │   │   │   │   │   │   if done:                                      │
│   2955 │   │   │   │   │   │   │   shards_done += 1                          │
│   2956 │   │   │   │   │   │   │   logger.debug(f"Finished processing shard  │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/datasets/arro │
│ w_dataset.py:3307 in _map_single                                             │
│                                                                              │
│   3304 │   │   │   │   if not batched:                                       │
│   3305 │   │   │   │   │   _time = time.time()                               │
│   3306 │   │   │   │   │   for i, example in shard_iterable:                 │
│ ❱ 3307 │   │   │   │   │   │   example = apply_function_on_filtered_inputs(e │
│   3308 │   │   │   │   │   │   if update_data:                               │
│   3309 │   │   │   │   │   │   │   if i == 0:                                │
│   3310 │   │   │   │   │   │   │   │   buf_writer, writer, tmp_file = init_b │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/datasets/arro │
│ w_dataset.py:3210 in apply_function_on_filtered_inputs                       │
│                                                                              │
│   3207 │   │   │   │   additional_args += (effective_indices,)               │
│   3208 │   │   │   if with_rank:                                             │
│   3209 │   │   │   │   additional_args += (rank,)                            │
│ ❱ 3210 │   │   │   processed_inputs = function(*fn_args, *additional_args, * │
│   3211 │   │   │   if isinstance(processed_inputs, LazyDict):                │
│   3212 │   │   │   │   processed_inputs = {                                  │
│   3213 │   │   │   │   │   k: v for k, v in processed_inputs.data.items() if │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/composer/data │
│ sets/in_context_learning_evaluation.py:156 in <lambda>                       │
│                                                                              │
│    153 │   │   │   dataset.map(lambda examples: {                            │
│    154 │   │   │   │   'context': examples['context'],                       │
│    155 │   │   │   │   'answer': examples['answer'],                         │
│ ❱  156 │   │   │   │   'aliases': examples['aliases']                        │
│    157 │   │   │   }))                                                       │
│    158 │   │   self.samples = strip_data(self.samples)                       │
│    159 │   │   self.tokenizer = tokenizer                                    │
│                                                                              │
│ /home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/datasets/form │
│ atting/formatting.py:280 in __getitem__                                      │
│                                                                              │
│   277 │   │   return len(self.data)                                          │
│   278 │                                                                      │
│   279 │   def __getitem__(self, key):                                        │
│ ❱ 280 │   │   value = self.data[key]                                         │
│   281 │   │   if key in self.keys_to_format:                                 │
│   282 │   │   │   value = self.format(key)                                   │
│   283 │   │   │   self.data[key] = value                                     │
╰──────────────────────────────────────────────────────────────────────────────╯
KeyError: 'aliases'

Muennighoff commented 3 months ago

I think heavy.yml is missing HumanEval, which is part of extended no? cc @Vaishaal

afang-story commented 3 months ago

@ch-shin

heavy.yaml is the correct one to run.
Can you provide more information on which task it failed on? You can also try using a specific llm-foundry version (one of 0.7.0 or 0.8.0 should work)

Muennighoff commented 3 months ago

heavy.yaml is the correct one to run.

But isn't it missing HumanEval @afang-story ?

afang-story commented 3 months ago

@Muennighoff HumanEval can be found in heavy_code.yaml which includes heavy.yaml as well as additional code evaluations.

ch-shin commented 2 months ago

The error is resolved by updating llm-foundry (0.2.0 --> 0.7.0). But got another error after then.

Map:   0%|          | 0/373 [00:00<?, ? examples/s]
Map:  17%|█▋        | 65/373 [00:00<00:00, 636.73 examples/s]
Map:  35%|███▌      | 131/373 [00:00<00:00, 642.71 examples/s]
Map:  53%|█████▎    | 197/373 [00:00<00:00, 642.93 examples/s]
Map:  79%|███████▊  | 293/373 [00:00<00:00, 640.25 examples/s]
Map:  96%|█████████▌| 358/373 [00:00<00:00, 640.76 examples/s]
Map: 100%|██████████| 373/373 [00:00<00:00, 608.93 examples/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/research_nfs/dclm/eval/eval_openlm_ckpt.py", line 551, in <module>
[rank0]:     main()
[rank0]:   File "/home/ubuntu/research_nfs/dclm/eval/eval_openlm_ckpt.py", line 514, in main
[rank0]:     icl_results = evaluate(eval_model, tokenizer, eval_cfg)
[rank0]:   File "/home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/ubuntu/research_nfs/dclm/eval/eval_openlm_ckpt.py", line 148, in evaluate
[rank0]:     evaluators, logger_keys = build_icl_evaluators(
[rank0]:   File "/home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/llmfoundry/utils/builders.py", line 576, in build_icl_evaluators
[rank0]:     _validate_cfg(icl_cfg)
[rank0]:   File "/home/ubuntu/miniconda3/envs/dclm/lib/python3.10/site-packages/llmfoundry/utils/builders.py", line 544, in _validate_cfg
[rank0]:     raise ValueError(
[rank0]: ValueError: No metric_names defined, unable to build default metrics for icl_task_type=question_answering.

It was fixed by changing icl_task_type=question_answering --> icl_task_type=generation_task_with_answers in heavy.yml.

afang-story commented 2 months ago

It may also be possible that updating to 0.8.0 will fix this. Anyways, glad that things seem to work now. Marking as closed.

mlfoundations / dclm

Missing files or bugs in evaluation code? #31