intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.6k stars 1.26k forks source link

Flaws in benchmark scripts #9162

Open Ariadne330 opened 1 year ago

Ariadne330 commented 1 year ago

About use_cache parameter , according to https://huggingface.co/docs/transformers/main_classes/text_generation , the default value of this parameter is True, which explains the minor difference in inference time.
By the way, when running benchmark test scripts, run_transformer_int4_gpu and run_optimize_model_gpu set the use_cache to be true (which leads to the error report), while run_transformer_int4_ and run_optimize_model ignore the parameter, this should be fixed.

Ricky-Ting commented 1 year ago
  1. In run_optimize_model and run_optimize_model_gpu, load_in_4bit should be removed.
  2. Should add support for T5 models, which uses AutoModelForSeq2SeqLM and modules_to_not_convert.
Oscilloscope98 commented 1 year ago

trust_remote_code is also missed in some else options.