Update lm-eval version and API

veekaybee commented 8 months ago

What's changing

LM-eval has now been bumped to 0.4.2, incorporating trust_remote_code=True as a default for a select set of curated datasets, removing the warning for those datasets specifically. Also bumping datasets as a result.

In the change, we also remove the need to call lm_eval.tasks.initialize_tasks()

import lm_eval

# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

How to test it

Run a test Ray job using the evaluate entrypoint to evaluate using some default params:

# Model to evaluate
model:
  load_from: "tiiuae/falcon-7b"
  torch_dtype: "bfloat16"

# Settings specific to lm_harness.evaluate
evaluator (note this is still using the old API):
  tasks: ["arithmetic"]
  num_fewshot: 5
  limit: 10

quantization:
  load_in_4bit: True
  bnb_4bit_quant_type: "fp4"
  bnb_4bit_compute_dtype: "bfloat16"

# Tracking info for where to log the run results
tracking:
  name: "tiiuae-falcon-7b-trust"
  project: "vicki-entity"
  entity: "entity"

Related Jira Ticket

Additional notes for reviewers

aittalam commented 8 months ago

What does lm_eval.tasks.ALL_TASKS return now? Is it the full list of tasks without the need to do any initialization? If so this is great, as it will solve the issue we had when multiple inits occurred!

veekaybee commented 8 months ago

What does lm_eval.tasks.ALL_TASKS return now?

This was changed to task_manager.all_tasks https://github.com/EleutherAI/lm-evaluation-harness/pull/1321/files#diff-45b7cf15ed225746696faea6973a591c9526c8a78f486e28b11f63635e543666R13.

I think this initializes all tasks, but we can do some testing to be sure.

aittalam commented 8 months ago

What does lm_eval.tasks.ALL_TASKS return now?

This was changed to task_manager.all_tasks https://github.com/EleutherAI/lm-evaluation-harness/pull/1321/files#diff-45b7cf15ed225746696faea6973a591c9526c8a78f486e28b11f63635e543666R13.

I think this initializes all tasks, but we can do some testing to be sure.

I looked into the code, it seems that a new TaskManager (which performs tasks initialization at init time) is created if none is specified, so that should fix the problem that required us to manually run the initialization.

@sfriedowitz can you test the new code in your new environment once it is merged so we can see if that solves the concurrence issue? I think it should, but if not we can move the lazy TaskManager creation up into our code and just pass it to simple_evaluate as a parameter.

mozilla-ai / lm-buddy