vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Apache License 2.0
407 stars 29 forks source link

lm_eval compatibility with generated model #83

Open horheynm opened 1 month ago

horheynm commented 1 month ago

Describe the bug After a model is generated running big_model_fp8.py, lm_eval dont not work unless the .py files from the original base model is transferred to the generated model folder. Happens for https://huggingface.co/microsoft/Phi-3-medium-128k-instruct

OSError: test_phi_3_medium_128k_instruct_fp8 does not appear to have a file named configuration_phi3.py. Checkout 'https://huggingface.co/test_phi_3_medium_128k_instruct_fp8/tree/None' for available files.

Expected behavior run lm_eval without any errors

Environment Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]:
  2. Python version [e.g. 3.7]:
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]:
  4. ML framework version(s) [e.g. torch 2.3.1]:
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
  6. Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce

import torch
from transformers import AutoTokenizer

from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import (  # noqa
    calculate_offload_device_map,
    custom_offload_device_map,
)

# define a llmcompressor recipe for FP8 quantization
# this recipe requires no calibration data since inputs are dynamically quantized
recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: channel
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: token
                        dynamic: true
                        symmetric: true
                    targets: ["Linear"]
"""

# model_stub = "meta-llama/Meta-Llama-3-70B-Instruct"
model_stub = "microsoft/Phi-3-medium-128k-instruct"

# determine which layers to offload to cpu based on available resources
device_map = calculate_offload_device_map(
    model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype=torch.float16
)

# alternatively, specify the maximum memory to allocate per GPU directly
# device_map = custom_offload_device_map(
#    model_stub, max_memory_per_gpu="10GB", num_gpus=2, torch_dtype=torch.float16
# )

model = SparseAutoModelForCausalLM.from_pretrained(
    model_stub, torch_dtype=torch.float16, device_map=device_map
)

# output_dir = "./test_output_llama3b_70b_fp8"
output_dir = "./test_" + model_stub.split("/")[-1].replace("-", "_").lower() + "_fp8"

oneshot(
    model=model,
    recipe=recipe,
    output_dir=output_dir,
    save_compressed=True,
    tokenizer=AutoTokenizer.from_pretrained(model_stub),
)

Then run

 CUDA_VISIBLE_DEVICES=5 bash eval_openllm.sh "test_phi_3_medium_128k_instruct_fp8" "tensor_parallel_size=1,max_model_len=4096,trust_remote_code=True,add_bos_token=True,gpu_memory_utilization=0.7"

where eval_openllm.sh is

export MODEL_DIR=${1}
export MODEL_ARGS=${2}
MODEL_NAME=$(basename ${MODEL_DIR})

declare -A tasks_fewshot=(
    ["gsm8k"]=5
)

declare -A batch_sizes=(
    ["gsm8k"]=16
)

for TASK in "${!tasks_fewshot[@]}"; do
    NUM_FEWSHOT=${tasks_fewshot[$TASK]}
    BATCH_SIZE=${batch_sizes[$TASK]}
    lm_eval --model vllm \
        --model_args pretrained=$MODEL_DIR,$MODEL_ARGS \
        --tasks ${TASK} \
        --num_fewshot ${NUM_FEWSHOT} \
        --write_out \
        --show_config \
        --device cuda \
        --batch_size ${BATCH_SIZE} \
        --trust_remote_code \
        --output_path="results/${MODEL_NAME}/${TASK}"
done

Errors If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

Additional context Add any other context about the problem here. Also include any relevant files.

robertgshaw2-neuralmagic commented 3 weeks ago

@horheynm can this be closed?