unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.42k stars 1.29k forks source link

How to save a llama3.2 model? #1112

Closed fzyzcjy closed 1 month ago

fzyzcjy commented 1 month ago

Error

``` --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[6], line 1 ----> 1 objects.model_pretrained.inner.save_pretrained('/tmp/b') 2 get_ipython().system('ls -alh /tmp/b') File /opt/conda/lib/python3.11/site-packages/transformers/modeling_utils.py:2749, in PreTrainedModel.save_pretrained(self, save_directory, is_main_process, state_dict, save_function, push_to_hub, max_shard_size, safe_serialization, variant, token, save_peft_format, **kwargs) 2744 gc.collect() 2746 if safe_serialization: 2747 # At some point we will need to deal better with save_function (used for TPU and other distributed 2748 # joyfulness), but for now this enough. -> 2749 safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"}) 2750 else: 2751 save_function(shard, os.path.join(save_directory, shard_file)) File /opt/conda/lib/python3.11/site-packages/safetensors/torch.py:284, in save_file(tensors, filename, metadata) 253 def save_file( 254 tensors: Dict[str, torch.Tensor], 255 filename: Union[str, os.PathLike], 256 metadata: Optional[Dict[str, str]] = None, 257 ): 258 """ 259 Saves a dictionary of tensors into raw bytes in safetensors format. 260 (...) 282 ``` 283 """ --> 284 serialize_file(_flatten(tensors), filename, metadata=metadata) File /opt/conda/lib/python3.11/site-packages/safetensors/torch.py:480, in _flatten(tensors) 477 failing.append(names) 479 if failing: --> 480 raise RuntimeError( 481 f""" 482 Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: {failing}. 483 A potential way to correctly save your model is to use `save_model`. 484 More information at https://huggingface.co/docs/safetensors/torch_shared_tensors 485 """ 486 ) 488 return { 489 k: { 490 "dtype": str(v.dtype).split(".")[-1], (...) 494 for k, v in tensors.items() 495 } RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'lm_head.weight', 'model.embed_tokens.weight'}]. A potential way to correctly save your model is to use `save_model`. More information at https://huggingface.co/docs/safetensors/torch_shared_tensors ```

If use workaround in #278

model.save_pretrained('...', safe_serialization=False)

it can be saved, but cannot be loaded by vllm:

``` INFO 10-07 03:08:28 api_server.py:219] vLLM API server version 0.5.3.post1 INFO 10-07 03:08:28 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/host_home/mlflow_artifact_root/701/1f8f48bc670c4d6b9eb237d2f4e97176/artifacts/output', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) INFO 10-07 03:08:28 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/host_home/mlflow_artifact_root/701/1f8f48bc670c4d6b9eb237d2f4e97176/artifacts/output', speculative_config=None, tokenizer='/host_home/mlflow_artifact_root/701/1f8f48bc670c4d6b9eb237d2f4e97176/artifacts/output', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/host_home/mlflow_artifact_root/701/1f8f48bc670c4d6b9eb237d2f4e97176/artifacts/output, use_v2_block_manager=False, enable_prefix_caching=False) INFO 10-07 03:08:28 model_runner.py:680] Starting to load model /host_home/mlflow_artifact_root/701/1f8f48bc670c4d6b9eb237d2f4e97176/artifacts/output... Loading pt checkpoint shards: 0% Completed | 0/1 [00:00 [rank0]: run_server(args) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server [rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args [rank0]: engine = cls( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__ [rank0]: self.engine = self._init_engine(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine [rank0]: return engine_class(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__ [rank0]: self.model_executor = executor_class( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__ [rank0]: self._init_executor() [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor [rank0]: self.driver_worker.load_model() [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model [rank0]: self.model_runner.load_model() [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model [rank0]: self.model = get_model(model_config=self.model_config, [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model [rank0]: return loader.load_model(model_config=model_config, [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 283, in load_model [rank0]: model.load_weights( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 508, in load_weights [rank0]: param = params_dict[name] [rank0]: KeyError: 'lm_head.weight' Loading pt checkpoint shards: 0% Completed | 0/1 [00:00

Related issues

Update

vllm now works after upgrading that. However, I would appreciate it if I could save the safetensors format!

danielhanchen commented 1 month ago

If this is on Colab / Kaggle, it's mainly because safetensors are slower to do :(

See https://github.com/unslothai/unsloth/wiki#saving-to-safetensors-not-bin-format-in-colab - ie set safe_serialization = None

fzyzcjy commented 1 month ago

I see, thank you!