Open hsb1995 opened 3 months ago
Hi, To properly evaluate the quantized model on downstream tasks, you should use the lm-evaluation-harness framework. This framework is designed to be compatible with Hugging Face Transformers and provides a standardized way to run various benchmarks.
Thank you Professor, I know the framework. But this framework is also needed to quantise the model after I get it and then do the inference task. In the project you provided, I checked the code carefully and didn't find the final quantised generated code and weights~
Hi, it actually will be quantized when you try to inference it. If you check our code closely you will find it will be quantized during the inference.
Dear Professor, I followed what you said about framing this lm-evaluation-harness , and tried to rewrite this code. The original article runs the code through lm_eval inside and I want to rewrite this to run as a py file. So, would really like to reproduce the results of your code to learn. Could you please help me with my reasoning code, it still doesn't work. I would be grateful if you could provide me with your reasoning code!
import argparse import fnmatch import json import logging import os import sys
import torch
sys.path.append("./lm-evaluation-harness") import lm_eval.models from lm_eval import evaluator, tasks, utils
try: import wandb
wandb_installed = True
except ModuleNotFoundError: wandb_installed = False
logging.getLogger("openai").setLevel(logging.WARNING)
class MultiChoice: def init(self, choices): self.choices = choices
# Simple wildcard support (linux filename patterns)
def __contains__(self, values):
for value in values.split(","):
if len(fnmatch.filter(self.choices, value)) == 0:
return False
return True
def __iter__(self):
for choice in self.choices:
yield choice
def parse_args(): parser = argparse.ArgumentParser() parser.add_argument("--model", required=True) parser.add_argument("--model_args", default="") parser.add_argument("--tasks", default=None, choices=MultiChoice(tasks.ALL_TASKS)) parser.add_argument("--provide_description", action="store_true") parser.add_argument("--num_fewshot", type=int, default=0) parser.add_argument("--batch_size", type=int, default=None) parser.add_argument("--exp_name", type=str, default=None, help="Experiment name if not load.") parser.add_argument("--device", type=str, default="cuda:0") parser.add_argument("--output_path", default=None) parser.add_argument("--limit", type=int, default=None) parser.add_argument("--decontamination_ngrams_path", default=None) parser.add_argument("--description_dict_path", default=None) parser.add_argument("--check_integrity", action="store_true") parser.add_argument("--log_wandb", action="store_true") parser.add_argument( "--model_seqlen", type=int, default=4096, choices=[2048, 4096], help="Model seqlen and calibration data context length.", ) parser.add_argument("--load_weights", type=str, default=None, help="Path to load model weights.")
return parser.parse_args()
def pattern_match(patterns, source_list): task_names = set() for pattern in patterns: for matching in fnmatch.filter(source_list, pattern): task_names.add(matching) return list(task_names)
def main(): args = parse_args()
assert not args.provide_description # not implemented
if args.log_wandb:
assert args.exp_name or args.load_weights
wandb.init(
config={a: getattr(args, a) for a in dir(args) if not a.startswith("_")},
)
if args.limit:
print("WARNING: --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")
if args.tasks is None:
task_names = tasks.ALL_TASKS
else:
task_names = pattern_match(args.tasks.split(","), tasks.ALL_TASKS)
print(f"Selected Tasks: {task_names}")
description_dict = {}
if args.description_dict_path:
with open(args.description_dict_path, "r") as f:
description_dict = json.load(f)
if args.model_args is None:
args.model_args = ""
lm = lm_eval.models.get_model(args.model).create_from_arg_string(
args.model_args, dict(batch_size=args.batch_size, device=args.device)
)
print("lm.device", lm.device)
if hasattr(lm.model, "hf_device_map"):
print("Model device map:\n", lm.model.hf_device_map)
if args.load_weights:
print("Loading model weights ...")
lm.model.load_state_dict(torch.load(args.load_weights))
results = evaluator.simple_evaluate(
model=lm,
model_args=args.model_args,
tasks=task_names,
num_fewshot=args.num_fewshot,
batch_size=args.batch_size,
device=args.device,
no_cache=True,
limit=args.limit,
description_dict=description_dict,
decontamination_ngrams_path=args.decontamination_ngrams_path,
check_integrity=args.check_integrity,
log_wandb=args.log_wandb,
)
if not isinstance(results["config"]["model"], str):
results["config"]["model"] = results["config"]["model"].model.config._name_or_path
dumped = json.dumps(results, indent=2)
print(dumped)
if args.output_path:
with open(args.output_path, "w") as f:
f.write(dumped)
print(
f"{args.model} ({args.model_args}), limit: {args.limit}, provide_description: {args.provide_description}, "
f"num_fewshot: {args.num_fewshot}, batch_size: {args.batch_size}"
)
print(evaluator.make_table(results))
if name == "main": main()
How do you reproduce the results of the code in the paper in your code? According to your example, I get the following result when I execute "generate". But I still can't feel the quantisation in the paper on the downstream task.