intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.31k stars 1.23k forks source link

Saving of low-bit models for later loading? #10729

Open ElliottDyson opened 3 months ago

ElliottDyson commented 3 months ago

Dear Ipex Team,

I was wondering if there was a way of saving a model that has been optimised and quantised in its new state for future loading for HF/Pytorch models. I noticed there was a method in ipex_llm.optimize but this seems to be for a deprecated model loading method as it mentions a missing bigdl-llm config file. The reason for such a feature is slow loading when having to optimise and quantise upon loading each time, but this is exacerbated when loading a large model that ends up having to utilise paging space due to over-usage of RAM during this process, if this was one-off then it wouldn't be so bad, but for the moment this is the case. It would also enable the sharing of these IPEX-optimised models for other users of Intel-ARC GPUs.

Many thanks.

jason-dai commented 3 months ago

See the Save and load sub-section under the Code Examples section

ElliottDyson commented 3 months ago

My apologies. Thank you!

ElliottDyson commented 3 months ago

I seem to be having issues with the save function for the pytorch model, and was hoping you might know why, it did correctly save a huggingface safetensors model though.

2024-04-10 14:33:21,912 - INFO - intel_extension_for_pytorch auto imported
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [18:09<00:00, 363.25s/it]
2024-04-10 14:58:28,819 - INFO - Converting the current model to sym_int4 format......
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\torch\serialization.py", line 619, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\torch\serialization.py", line 853, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:593] . PytorchStreamWriter failed writing file data/0: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\ellio\miniconda3\envs\llm\Scripts\uvicorn.exe\__main__.py", line 7, in <module>
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\uvicorn\main.py", line 409, in main
    run(
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\uvicorn\main.py", line 575, in run
    server.run()
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\uvicorn\server.py", line 65, in run
    return asyncio.run(self.serve(sockets=sockets))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\asyncio\runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\asyncio\runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\asyncio\base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\uvicorn\server.py", line 69, in serve
    await self._serve(sockets)
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\uvicorn\server.py", line 76, in _serve
    config.load()
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\uvicorn\config.py", line 433, in load
    self.loaded_app = import_from_string(self.app)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\uvicorn\importer.py", line 19, in import_from_string
    module = importlib.import_module(module_str)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "D:\LLMs\API.py", line 74, in <module>
    function_model.save_low_bit("FunctionCallingModel/low_bit")
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\ipex_llm\transformers\model.py", line 71, in save_low_bit
    self.save_pretrained(*args, **kwargs)
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\transformers\modeling_utils.py", line 2114, in save_pretrained
    save_function(shard, os.path.join(save_directory, shard_file))
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\torch\serialization.py", line 618, in save
    with _open_zipfile_writer(f) as opened_zipfile:
  File "C:\Users\ellio\miniconda3\envs\llm\Lib\site-packages\torch\serialization.py", line 466, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 45056 vs 44950
Oscilloscope98 commented 3 months ago

Would you mind providing more regarding your test code, env, etc. so we could reproduce this issue? :)

ElliottDyson commented 3 months ago

Would you mind providing more regarding your test code, env, etc. so we could reproduce this issue? :)

Thank you, and sorry for the delay, we must be operating in fairly different timezones. I'm operating on the latest version of Windows 11, have the latest drivers and oneAPI installation, using transformers==4.34.0.

This is the relevant section of code that I am using, roleplay_model saves just fine, as you can see with the error message it is function_model that fails (the pytorch one):

from fastapi import FastAPI, Request, UploadFile, File, Body
import torch
from ipex_llm.transformers import AutoModelForCausalLM, AutoModelForSpeechSeq2Seq
#from transformers import AutoTokenizer, WhisperProcessor, TextGenerationPipeline

from ipex_llm import optimize_model
from ipex_llm.optimize import low_memory_init, load_low_bit
from transformers import AutoTokenizer, WhisperProcessor

from pydantic import BaseModel

app = FastAPI()

do_save_low_bit = True

roleplay_model_path = "text-generation-webui/models/Nexusflow_Starling-LM-7B-beta"
function_model_path = "FunctionCallingModel"
whisper_model_path = "whisper-med-en"

if do_save_low_bit == True:
    # Load Roleplay LLM model to save it (transformers)
    roleplay_model = AutoModelForCausalLM.from_pretrained(roleplay_model_path, load_in_4bit=True, trust_remote_code=True)
    roleplay_model.save_low_bit(roleplay_model_path + "/low_bit")

    # Load Function calling LLM model to save it (pytorch)
    function_model = AutoModelForCausalLM.from_pretrained(function_model_path, trust_remote_code=True)
    function_model = optimize_model(function_model, low_bit='sym_int4') 
    function_model.save_low_bit(function_model_path + "/low_bit")

    # Load Whisper model to save it (transformers)
    whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(whisper_model_path, load_in_4bit=True, trust_remote_code=True)
    whisper_model.save_low_bit(whisper_model_path + "/low_bit")
else:
    # Load Roleplay LLM model (transformers)
    roleplay_model = AutoModelForCausalLM.load_low_bit(roleplay_model_path + "/low_bit", trust_remote_code=True)
    roleplay_tokenizer = AutoTokenizer.from_pretrained(roleplay_model_path, trust_remote_code=True)

    # Load Function calling LLM model (pytorch)
    with low_memory_init():
        function_model = AutoModelForCausalLM.from_pretrained(function_model_path + "/low_bit", torch_dtype="auto", trust_remote_code=True)
    function_model = load_low_bit(function_model, function_model_path + "/low_bit")
    function_tokenizer = AutoTokenizer.from_pretrained(function_model_path, trust_remote_code=True)

    # Load Whisper model (transformers)
    whisper_model = AutoModelForSpeechSeq2Seq.load_low_bit(whisper_model_path + "/low_bit", trust_remote_code=True)
    whisper_model.config.forced_decoder_ids = None
    whisper_processor = WhisperProcessor.from_pretrained(whisper_model_path)
    forced_decoder_ids = whisper_processor.get_decoder_prompt_ids(language='english', task="transcribe")

    # Move models to GPU
    roleplay_model = roleplay_model.to('xpu')
    function_model = function_model.to('xpu')
    whisper_model = whisper_model.to('xpu')

    # Complete an inference on each model to initialize the model
    roleplay_model.generate(roleplay_model.dummy_inputs, max_new_tokens=1)
    function_model.generate(function_model.dummy_inputs, max_new_tokens=1)
    whisper_model.generate(whisper_model.dummy_inputs, max_new_tokens=1)

image

image

Oscilloscope98 commented 3 months ago

Hi @ElliottDyson,

Could you first verifying that there is sufficient space available on your disk to save the model, and checking whether your model file has been corrupted by another program when saving the model?

If those steps don't resolve the issue, would you mind providing us with your FunctionCallingModel (e.g. huggingface repo id of this model), and run the environment checking scripts to share the results? This will help us better understand and reproduce the exact error.

P.S. If FunctionCallingModel can be loaded with ipex_llm.transformers.AutoModelForCausalLM, it could also be loaded and saved with:

function_model = AutoModelForCausalLM.from_pretrained(function_model_path, load_in_4bit=True, trust_remote_code=True)
function_model.save_low_bit(function_model_path + "/low_bit")