modelscope / evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
https://evalscope.readthedocs.io/en/latest/
Apache License 2.0
199 stars 26 forks source link

llmuses 0.3.2 执行自带的数据集报错:ImportError: cannot import name '_datasets_server' from 'datasets.utils' (/data/anaconda3/envs/eval-scope/lib/python3.10/site-packages/datasets/utils/__init__.py) #76

Open jackqdldd opened 3 months ago

jackqdldd commented 3 months ago

python -m llmuses.run --model qwen/Qwen2-7B-Instruct --template-type qwen --datasets trivia_qa limit 2

slin000111 commented 3 months ago

报错是ModuleNotFoundError: No module named 'llmuses.benchmarks.limit'吗?

slin000111 commented 3 months ago

--limit 2

jackqdldd commented 3 months ago

image python -m llmuses.run --model qwen/Qwen2-7B-Instruct --template-type qwen --datasets trivia_qa --limit 2

jackqdldd commented 3 months ago

bbh 这个数据集执行也是上面的报错

slin000111 commented 3 months ago

这边测试环境: python 3.10 modelscope 1.16.0 git clone https://github.com/modelscope/eval-scope.git cd eval-scope/ pip install -e .

jackqdldd commented 3 months ago

环境是一样的: modelscope Version: 1.16.0 llmuses 0.4.0 image

python llmuses/run.py --model qwen/Qwen2-7B-Instruct --template-type qwen --datasets arc --dataset-hub Local --dataset-args '{"arc": {"local_path": "/root/eval-scope/data/arc"}}' --dataset-dir /root/eval-scope/data/

slin000111 commented 3 months ago

浮点数例外这个报错貌似和你的环境有关,尝试运行下面例程检测一下:

from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen2-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen2-7B-Instruct")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
jackqdldd commented 3 months ago

image

slin000111 commented 3 months ago

显卡支持bf16吗

jackqdldd commented 3 months ago

果然是环境的问题,请问支持访问部署好的模型吗?比如远程部署了大模型,怎么通过地址来评测这个大模型

slin000111 commented 3 months ago

下面测试用了vllm部署的模型,换成你的url、model以及dataset_path,可以先用curl测试远程部署的大模型。

llmuses perf --url 'http://127.0.0.1:8000/v1/chat/completions' --parallel 1 --model '/mnt/workspace/qwen2-7b-instruct/qwen/Qwen2-7B-Instruct' --log-every-n-query 10 --read-timeout=120 --dataset-path '/mnt/workspace/HC3-Chinese/open_qa.jsonl' -n 50 --max-prompt-length 128000 --api openai --stream --dataset openqa
jackqdldd commented 3 months ago

谢谢,我试了上面的方法是可以通的,不过这个是测性能的吧,对模型结果验证需要怎么做呢?用自带的数据集或者自定义数据集验证模型的能力,模型在远端机器部署着