Open RonanKMcGovern opened 9 months ago
If I may ask, are your inference results good? I am trying to infer but seems the generated output ends weirdly. I use a single A-100 to generate an output
Yup, inference is good.
Btw I modified float16 in the builder.py to be bfloat16. This is needed for stability in training.
how did you train? i thought the training codes are not out?
If I may ask, are your inference results good? I am trying to infer but seems the generated output ends weirdly. I use a single A-100 to generate an output
Check out https://github.com/haotian-liu/LLaVA/pull/1115 see if it solves your issue
Many thanks @aliencaocao ! I just tried those patches.
They fixed errors with not having set a padding token or attention mask for generation.
However, the 'copying from a non-meta parameter' error persists, and my training is still oom.
I'm just using a transformers trainer with everything loaded in bf16. BUT:
OutOfMemoryError Traceback (most recent call last)
Cell In[24], line 44
11 training_args = TrainingArguments(
12 output_dir=f"{model_name}-chess",
13 learning_rate=1e-4,
(...)
33 optim="adamw_torch",
34 )
36 trainer = Trainer(
37 model=model,
38 args=training_args,
(...)
41 # compute_loss=compute_loss, # Pass the custom compute_loss function
42 )
---> 44 trainer.train()
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1539, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1537 hf_hub_utils.enable_progress_bars()
1538 else:
-> 1539 return inner_training_loop(
1540 args=args,
1541 resume_from_checkpoint=resume_from_checkpoint,
1542 trial=trial,
1543 ignore_keys_for_eval=ignore_keys_for_eval,
1544 )
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1869, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1866 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
1868 with self.accelerator.accumulate(model):
-> 1869 tr_loss_step = self.training_step(model, inputs)
1871 if (
1872 args.logging_nan_inf_filter
1873 and not is_torch_tpu_available()
1874 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1875 ):
1876 # if loss is nan or inf simply add the average of previous logged losses
1877 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2772, in Trainer.training_step(self, model, inputs)
2769 return loss_mb.reduce_mean().detach().to(self.args.device)
2771 with self.compute_loss_context_manager():
-> 2772 loss = self.compute_loss(model, inputs)
2774 if self.args.n_gpu > 1:
2775 loss = loss.mean() # mean() to average on multi-gpu parallel training
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2795, in Trainer.compute_loss(self, model, inputs, return_outputs)
2793 else:
2794 labels = None
-> 2795 outputs = model(**inputs)
2796 # Save past state if it exists
2797 # TODO: this needs to be fixed and made cleaner later.
2798 if self.args.past_index >= 0:
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:581, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
580 def forward(*args, **kwargs):
--> 581 return model_forward(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:569, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
568 def __call__(self, *args, **kwargs):
--> 569 return convert_to_fp32(self.model_forward(*args, **kwargs))
File /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:16, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
13 @functools.wraps(func)
14 def decorate_autocast(*args, **kwargs):
15 with autocast_instance:
---> 16 return func(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/peft/peft_model.py:537, in PeftModel.forward(self, *args, **kwargs)
533 def forward(self, *args: Any, **kwargs: Any):
534 """
535 Forward pass of the model.
536 """
--> 537 return self.get_base_model()(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /workspace/LLaVA/llava/model/language_model/llava_mistral.py:91, in LlavaMistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, images, image_sizes, return_dict)
73 if inputs_embeds is None:
74 (
75 input_ids,
76 position_ids,
(...)
88 image_sizes
89 )
---> 91 return super().forward(
92 input_ids=input_ids,
93 attention_mask=attention_mask,
94 position_ids=position_ids,
95 past_key_values=past_key_values,
96 inputs_embeds=inputs_embeds,
97 labels=labels,
98 use_cache=use_cache,
99 output_attentions=output_attentions,
100 output_hidden_states=output_hidden_states,
101 return_dict=return_dict
102 )
File /usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py:1154, in MistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1151 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1153 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1154 outputs = self.model(
1155 input_ids=input_ids,
1156 attention_mask=attention_mask,
1157 position_ids=position_ids,
1158 past_key_values=past_key_values,
1159 inputs_embeds=inputs_embeds,
1160 use_cache=use_cache,
1161 output_attentions=output_attentions,
1162 output_hidden_states=output_hidden_states,
1163 return_dict=return_dict,
1164 )
1166 hidden_states = outputs[0]
1167 logits = self.lm_head(hidden_states)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py:1039, in MistralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
1029 layer_outputs = self._gradient_checkpointing_func(
1030 decoder_layer.__call__,
1031 hidden_states,
(...)
1036 use_cache,
1037 )
1038 else:
-> 1039 layer_outputs = decoder_layer(
1040 hidden_states,
1041 attention_mask=attention_mask,
1042 position_ids=position_ids,
1043 past_key_value=past_key_values,
1044 output_attentions=output_attentions,
1045 use_cache=use_cache,
1046 )
1048 hidden_states = layer_outputs[0]
1050 if use_cache:
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py:751, in MistralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, **kwargs)
735 """
736 Args:
737 hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
(...)
746 past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
747 """
749 residual = hidden_states
--> 751 hidden_states = self.input_layernorm(hidden_states)
753 # Self Attention
754 hidden_states, self_attn_weights, present_key_value = self.self_attn(
755 hidden_states=hidden_states,
756 attention_mask=attention_mask,
(...)
760 use_cache=use_cache,
761 )
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py:87, in MistralRMSNorm.forward(self, hidden_states)
85 hidden_states = hidden_states.to(torch.float32)
86 variance = hidden_states.pow(2).mean(-1, keepdim=True)
---> 87 hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
88 return self.weight * hidden_states.to(input_dtype)
OutOfMemoryError: CUDA out of memory. Tried to allocate 150.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 34.12 MiB is free. Process 3061903 has 47.49 GiB memory in use. Of the allocated memory 46.17 GiB is allocated by PyTorch, and 1023.43 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Well I cant do much about OOM. But I thought the repo said training code to be released...so you just used transformers trainer and it worked?
Ok, so (perhaps obviously) the OOM error was fixed by adding more VRAM. It takes me 100 GB of VRAM to fine-tune the Llava 1.6 7B model in bf16. That is 2x what it takes to fine-tune the Llava 1.5B model.
The loading error remains so I'll keep this issue open:
warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta '
And yes, aliencaocao, the huggingface trainer works (although will be unstable with fp16).
The whole time, this issue was a lack of VRAM.
Llava 1.6 takes multiple (overlapping) patches of the input image and uses them as input. This seems to be one driving reason for high VRAM during inference and fine tuning.
For 1.6 Mistral I needed 3 A6000s to load in bf16.
The whole time, this issue was a lack of VRAM.
Llava 1.6 takes multiple (overlapping) patches of the input image and uses them as input. This seems to be one driving reason for high VRAM during inference and fine tuning.
For 1.6 Mistral I needed 3 A6000s to load in bf16.
I loaded and ran inference on a 12gb 3080ti....
EDIT: sorry if you meant training in bf16. You sounded like trying to infer
The whole time, this issue was a lack of VRAM. Llava 1.6 takes multiple (overlapping) patches of the input image and uses them as input. This seems to be one driving reason for high VRAM during inference and fine tuning. For 1.6 Mistral I needed 3 A6000s to load in bf16.
I loaded and ran inference on a 12gb 3080ti....
you must be running quantized though to get in 12 GB?
yeah for training I run out of VRAM, although this meta error appears when I load in a single A6000, although inference runs fine. TBH I need to better replicate exactly what’s happening.
The whole time, this issue was a lack of VRAM. Llava 1.6 takes multiple (overlapping) patches of the input image and uses them as input. This seems to be one driving reason for high VRAM during inference and fine tuning. For 1.6 Mistral I needed 3 A6000s to load in bf16.
I loaded and ran inference on a 12gb 3080ti....
you must be running quantized though to get in 12 GB?
yeah for training I run out of VRAM, although this meta error appears when I load in a single A6000, although inference runs fine. TBH I need to better replicate exactly what’s happening.
No i loaded in fp16 for inference, using sglang. Its very slow though and i think its silently offloading some to cpu ram
Yeah ok, thanks, makes sense as the 7B alone would be ~13 GB of VRAM.
I'm curious how much cpu RAM you have? (because it is indeed surprising to me that loading in bf16 loads some layers to cpu when I have 48 GB VRAM)...
I have 32GB and its using more than 10GB
The whole time, this issue was a lack of VRAM.
Llava 1.6 takes multiple (overlapping) patches of the input image and uses them as input. This seems to be one driving reason for high VRAM during inference and fine tuning.
For 1.6 Mistral I needed 3 A6000s to load in bf16.
Hello, may I confirm if the 'copying from a non-meta parameter' problem has been solved with larger VRAM?
Yes much bigger VRAM but I’m puzzled by why it takes so much.
On Fri 23 Feb 2024 at 17:49, Salomeeeee @.***> wrote:
The whole time, this issue was a lack of VRAM.
Llava 1.6 takes multiple (overlapping) patches of the input image and uses them as input. This seems to be one driving reason for high VRAM during inference and fine tuning.
For 1.6 Mistral I needed 3 A6000s to load in bf16.
Hello, may I confirm if the 'copying from a non-meta parameter' problem has been solved with larger VRAM?
— Reply to this email directly, view it on GitHub https://github.com/haotian-liu/LLaVA/issues/1122#issuecomment-1961749156, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CS4AGGA4NE7NYPD6RDYVDJDBAVCNFSM6AAAAABDERM7W6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRG42DSMJVGY . You are receiving this because you modified the open/close state.Message ID: @.***>
Reopening as the large memory footprint for training (even when loaded in bf16) remains unexplained. It takes ~100 GB to fine-tune LLaVA 1.6 7B in bf16...
do the warnings matter anyway? While it seems to output some reasonable thing, we don't know if it reaches the real performance as it's a LLM!
Describe the issue
Issue/Error: Loading 1.5 models works fine, but loading 1.6 models yield the error below. Note that the 1.6 models do load (despite the error) and inference works. However, training the 1.6 model results in OOM (unlike the 1.5 models which train fine - specifically training 7B models in bf16 on a 48GB A6000).
Command:
from llava.model.builder import load_pretrained_model from llava.mm_utils import get_model_name_from_path from llava.eval.run_llava import eval_model device = "cuda" if torch.cuda.is_available() else "cpu" model_path = "liuhaotian/llava-v1.6-mistral-7b" model_name=get_model_name_from_path(model_path) tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=model_name, # load_8bit=True # load_4bit=True )
Log:
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.__get__(instance, owner)() /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:2025: UserWarning: for vision_model.embeddings.class_embedding: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass `assign=True` to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?) warnings.warn(f'for {key}: copying from a non-meta parameter in the checkpoint to a meta ' /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:2025: UserWarning: for vision_model.embeddings.patch_embedding.weight: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean to pass assign=
Set device map=auto in vison tower load() or set unfreeze vision model =false in config
Thanks @Linziyang1999 , appreciate the response.
btw, device_map is already defaulting to 'auto' in builder.py (note that I've only changed from float16 to bfloat16 here):
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import warnings
import shutil
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
import torch
from llava.model import *
from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs):
kwargs = {"device_map": device_map, **kwargs}
if device != "cuda":
kwargs['device_map'] = {"": device}
if load_8bit:
kwargs['load_in_8bit'] = True
elif load_4bit:
kwargs['load_in_4bit'] = True
kwargs['quantization_config'] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
)
else:
kwargs['torch_dtype'] = torch.bfloat16
if use_flash_attn:
kwargs['attn_implementation'] = 'flash_attention_2'
if 'llava' in model_name.lower():
# Load LLaVA model
if 'lora' in model_name.lower() and model_base is None:
warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.')
if 'lora' in model_name.lower() and model_base is not None:
from llava.model.language_model.llava_llama import LlavaConfig
lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
print('Loading LLaVA from base model...')
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
if model.lm_head.weight.shape[0] != token_num:
model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
print('Loading additional LLaVA weights...')
if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
else:
# this is probably from HF Hub
from huggingface_hub import hf_hub_download
def load_from_hf(repo_id, filename, subfolder=None):
cache_file = hf_hub_download(
repo_id=repo_id,
filename=filename,
subfolder=subfolder)
return torch.load(cache_file, map_location='cpu')
non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
if any(k.startswith('model.model.') for k in non_lora_trainables):
non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
model.load_state_dict(non_lora_trainables, strict=False)
from peft import PeftModel
print('Loading LoRA weights...')
model = PeftModel.from_pretrained(model, model_path)
print('Merging LoRA weights...')
model = model.merge_and_unload()
print('Model is loaded...')
elif model_base is not None:
# this may be mm projector only
print('Loading LLaVA from base model...')
if 'mpt' in model_name.lower():
if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')):
shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py'))
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True)
cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
else:
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
mm_projector_weights = {k: v.to(torch.bfloat16) for k, v in mm_projector_weights.items()}
model.load_state_dict(mm_projector_weights, strict=False)
else:
if 'mpt' in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
elif 'mistral' in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlavaMistralForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
**kwargs
)
else:
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = LlavaLlamaForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
**kwargs
)
else:
# Load language model
if model_base is not None:
# PEFT model
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs)
print(f"Loading LoRA weights from {model_path}")
model = PeftModel.from_pretrained(model, model_path)
print(f"Merging weights")
model = model.merge_and_unload()
print('Convert to FP16...')
model.to(torch.bfloat16)
else:
use_fast = False
if 'mpt' in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs)
else:
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
image_processor = None
if 'llava' in model_name.lower():
mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
if mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
if mm_use_im_start_end:
tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
vision_tower = model.get_vision_tower()
if not vision_tower.is_loaded:
print('vision_tower is not loaded so loading it now')
vision_tower.load_model(device_map=device_map)
vision_tower.to(device=device, dtype=torch.bfloat16)
else:
print('vision_tower is loaded')
image_processor = vision_tower.image_processor
if hasattr(model.config, "max_sequence_length"):
context_len = model.config.max_sequence_length
else:
context_len = 2048
return tokenizer, model, image_processor, context_len
You say to update the config
as an alternative. Where specifically should I do that?
Thanks @Linziyang1999 , appreciate the response.
btw, device_map is already defaulting to 'auto' in builder.py (note that I've only changed from float16 to bfloat16 here):
# Copyright 2023 Haotian Liu # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import warnings import shutil from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig import torch from llava.model import * from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs): kwargs = {"device_map": device_map, **kwargs} if device != "cuda": kwargs['device_map'] = {"": device} if load_8bit: kwargs['load_in_8bit'] = True elif load_4bit: kwargs['load_in_4bit'] = True kwargs['quantization_config'] = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ) else: kwargs['torch_dtype'] = torch.bfloat16 if use_flash_attn: kwargs['attn_implementation'] = 'flash_attention_2' if 'llava' in model_name.lower(): # Load LLaVA model if 'lora' in model_name.lower() and model_base is None: warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.') if 'lora' in model_name.lower() and model_base is not None: from llava.model.language_model.llava_llama import LlavaConfig lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) print('Loading LLaVA from base model...') model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs) token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features if model.lm_head.weight.shape[0] != token_num: model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) print('Loading additional LLaVA weights...') if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')): non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu') else: # this is probably from HF Hub from huggingface_hub import hf_hub_download def load_from_hf(repo_id, filename, subfolder=None): cache_file = hf_hub_download( repo_id=repo_id, filename=filename, subfolder=subfolder) return torch.load(cache_file, map_location='cpu') non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin') non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()} if any(k.startswith('model.model.') for k in non_lora_trainables): non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()} model.load_state_dict(non_lora_trainables, strict=False) from peft import PeftModel print('Loading LoRA weights...') model = PeftModel.from_pretrained(model, model_path) print('Merging LoRA weights...') model = model.merge_and_unload() print('Model is loaded...') elif model_base is not None: # this may be mm projector only print('Loading LLaVA from base model...') if 'mpt' in model_name.lower(): if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')): shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py')) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True) cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu') mm_projector_weights = {k: v.to(torch.bfloat16) for k, v in mm_projector_weights.items()} model.load_state_dict(mm_projector_weights, strict=False) else: if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) elif 'mistral' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path) model = LlavaMistralForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = LlavaLlamaForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: # Load language model if model_base is not None: # PEFT model from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs) print(f"Loading LoRA weights from {model_path}") model = PeftModel.from_pretrained(model, model_path) print(f"Merging weights") model = model.merge_and_unload() print('Convert to FP16...') model.to(torch.bfloat16) else: use_fast = False if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) image_processor = None if 'llava' in model_name.lower(): mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False) mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True) if mm_use_im_patch_token: tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) if mm_use_im_start_end: tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) model.resize_token_embeddings(len(tokenizer)) vision_tower = model.get_vision_tower() if not vision_tower.is_loaded: print('vision_tower is not loaded so loading it now') vision_tower.load_model(device_map=device_map) vision_tower.to(device=device, dtype=torch.bfloat16) else: print('vision_tower is loaded') image_processor = vision_tower.image_processor if hasattr(model.config, "max_sequence_length"): context_len = model.config.max_sequence_length else: context_len = 2048 return tokenizer, model, image_processor, context_len
You say to update the
config
as an alternative. Where specifically should I do that?
the reason why encounter with this warning is because from_pretrained() method allocate a meta device to vision encoder for low cpu memory using, so when you init you will get this warning, but it not a problem at all cause the paramters still load properly. To avoid this waring you can change the default load_model() method in clipvitiontower 、、、
def load_model(self, device_map=“auto”):
if self.is_loaded:
print('{} is already loaded, `load_model` called again, skipping.'.format(self.vision_tower_name))
return
#print("load model from", self.vision_tower_name)
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name,device_map = device_map)
#input(device_map)
#print(self.vision_tower.device.type)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
、、、 you can also avoid this by set unfreeze_vison_tower = false in config in your model file (like llava-v1.6-mistral-7b/config), be careful if you use this method, you will load vision tower parameter from original clip-vit-large-patch14-336 rather than vit parameter save in llava model parameter. The original paramters was not finetuned you may using some method to change it or you will received irresponsible anwser. 、、、 "model.vision_tower.vision_tower.vision_model.embeddings.class_embedding": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.embeddings.position_embedding.weight": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00003-of-00004.safetensors", 、、、 you can use me method to change the paramters 、、、 import torch from safetensors.torch import load_file
model_weights = torch.load('./clip-vit-large-patch14-336/pytorch_model.bin') model_weight_llava = load_file('./llava-v1.6-mistral-7b/model-00003-of-00004.safetensors')
for layer_name, param in model_weights.items(): llava_key = "model.vision_tower.vision_tower." + layer_name if llava_key in model_weight_llava and "vision_model.encoder" in layer_name: if not torch.equal(model_weight_llava[llava_key], param): print(llava_key, " updated!" ) model_weights[layer_name] = model_weight_llava[llava_key]
torch.save(model_weights, './clip-vit-large-patch14-336/pytorch_model.bin') 、、、
Thanks @Linziyang1999 , appreciate the response.
btw, device_map is already defaulting to 'auto' in builder.py (note that I've only changed from float16 to bfloat16 here):
# Copyright 2023 Haotian Liu # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import warnings import shutil from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig import torch from llava.model import * from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs): kwargs = {"device_map": device_map, **kwargs} if device != "cuda": kwargs['device_map'] = {"": device} if load_8bit: kwargs['load_in_8bit'] = True elif load_4bit: kwargs['load_in_4bit'] = True kwargs['quantization_config'] = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ) else: kwargs['torch_dtype'] = torch.bfloat16 if use_flash_attn: kwargs['attn_implementation'] = 'flash_attention_2' if 'llava' in model_name.lower(): # Load LLaVA model if 'lora' in model_name.lower() and model_base is None: warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.') if 'lora' in model_name.lower() and model_base is not None: from llava.model.language_model.llava_llama import LlavaConfig lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) print('Loading LLaVA from base model...') model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs) token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features if model.lm_head.weight.shape[0] != token_num: model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) print('Loading additional LLaVA weights...') if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')): non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu') else: # this is probably from HF Hub from huggingface_hub import hf_hub_download def load_from_hf(repo_id, filename, subfolder=None): cache_file = hf_hub_download( repo_id=repo_id, filename=filename, subfolder=subfolder) return torch.load(cache_file, map_location='cpu') non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin') non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()} if any(k.startswith('model.model.') for k in non_lora_trainables): non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()} model.load_state_dict(non_lora_trainables, strict=False) from peft import PeftModel print('Loading LoRA weights...') model = PeftModel.from_pretrained(model, model_path) print('Merging LoRA weights...') model = model.merge_and_unload() print('Model is loaded...') elif model_base is not None: # this may be mm projector only print('Loading LLaVA from base model...') if 'mpt' in model_name.lower(): if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')): shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py')) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True) cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu') mm_projector_weights = {k: v.to(torch.bfloat16) for k, v in mm_projector_weights.items()} model.load_state_dict(mm_projector_weights, strict=False) else: if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) elif 'mistral' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path) model = LlavaMistralForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = LlavaLlamaForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: # Load language model if model_base is not None: # PEFT model from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs) print(f"Loading LoRA weights from {model_path}") model = PeftModel.from_pretrained(model, model_path) print(f"Merging weights") model = model.merge_and_unload() print('Convert to FP16...') model.to(torch.bfloat16) else: use_fast = False if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) image_processor = None if 'llava' in model_name.lower(): mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False) mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True) if mm_use_im_patch_token: tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) if mm_use_im_start_end: tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) model.resize_token_embeddings(len(tokenizer)) vision_tower = model.get_vision_tower() if not vision_tower.is_loaded: print('vision_tower is not loaded so loading it now') vision_tower.load_model(device_map=device_map) vision_tower.to(device=device, dtype=torch.bfloat16) else: print('vision_tower is loaded') image_processor = vision_tower.image_processor if hasattr(model.config, "max_sequence_length"): context_len = model.config.max_sequence_length else: context_len = 2048 return tokenizer, model, image_processor, context_len
You say to update the
config
as an alternative. Where specifically should I do that?
i just read the from_pretrained() method in transformers, after search in the code i find this warning is a design for not load weight twice, so just ignore it.
Thanks @Linziyang1999 , appreciate the response.
btw, device_map is already defaulting to 'auto' in builder.py (note that I've only changed from float16 to bfloat16 here):
# Copyright 2023 Haotian Liu # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import warnings import shutil from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig import torch from llava.model import * from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs): kwargs = {"device_map": device_map, **kwargs} if device != "cuda": kwargs['device_map'] = {"": device} if load_8bit: kwargs['load_in_8bit'] = True elif load_4bit: kwargs['load_in_4bit'] = True kwargs['quantization_config'] = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ) else: kwargs['torch_dtype'] = torch.bfloat16 if use_flash_attn: kwargs['attn_implementation'] = 'flash_attention_2' if 'llava' in model_name.lower(): # Load LLaVA model if 'lora' in model_name.lower() and model_base is None: warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.') if 'lora' in model_name.lower() and model_base is not None: from llava.model.language_model.llava_llama import LlavaConfig lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) print('Loading LLaVA from base model...') model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs) token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features if model.lm_head.weight.shape[0] != token_num: model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) print('Loading additional LLaVA weights...') if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')): non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu') else: # this is probably from HF Hub from huggingface_hub import hf_hub_download def load_from_hf(repo_id, filename, subfolder=None): cache_file = hf_hub_download( repo_id=repo_id, filename=filename, subfolder=subfolder) return torch.load(cache_file, map_location='cpu') non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin') non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()} if any(k.startswith('model.model.') for k in non_lora_trainables): non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()} model.load_state_dict(non_lora_trainables, strict=False) from peft import PeftModel print('Loading LoRA weights...') model = PeftModel.from_pretrained(model, model_path) print('Merging LoRA weights...') model = model.merge_and_unload() print('Model is loaded...') elif model_base is not None: # this may be mm projector only print('Loading LLaVA from base model...') if 'mpt' in model_name.lower(): if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')): shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py')) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True) cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu') mm_projector_weights = {k: v.to(torch.bfloat16) for k, v in mm_projector_weights.items()} model.load_state_dict(mm_projector_weights, strict=False) else: if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) elif 'mistral' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path) model = LlavaMistralForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = LlavaLlamaForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: # Load language model if model_base is not None: # PEFT model from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs) print(f"Loading LoRA weights from {model_path}") model = PeftModel.from_pretrained(model, model_path) print(f"Merging weights") model = model.merge_and_unload() print('Convert to FP16...') model.to(torch.bfloat16) else: use_fast = False if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) image_processor = None if 'llava' in model_name.lower(): mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False) mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True) if mm_use_im_patch_token: tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) if mm_use_im_start_end: tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) model.resize_token_embeddings(len(tokenizer)) vision_tower = model.get_vision_tower() if not vision_tower.is_loaded: print('vision_tower is not loaded so loading it now') vision_tower.load_model(device_map=device_map) vision_tower.to(device=device, dtype=torch.bfloat16) else: print('vision_tower is loaded') image_processor = vision_tower.image_processor if hasattr(model.config, "max_sequence_length"): context_len = model.config.max_sequence_length else: context_len = 2048 return tokenizer, model, image_processor, context_len
You say to update the
config
as an alternative. Where specifically should I do that?
and this device_map is used in load llavamistralforcasuallm, this error heppend when you load clipvisionmodel, llava set device_map=none and low_cpu_mem_usage = false when load pretrain clipmodel, the _fast_init default as true in frpm_pretrained() method
init_contexts = [no_init_weights(_enable=_fast_init)]
so the model will init in meta device, and clipvison model save weight in one pytorch.bin, so this will call _load_state_dict_into_model() this is not design for load onto meta device so this method do nothing, and weight willl be loaded properly in llavamistralforcasuallm.from_pretrained(), and will call _load_state_dict_into_meta_model, this design for load in meta device. So the params will only load once.
Hope this will help you , and if you have any question feel free to contact me.
Thanks.
Unfortunately I don't follow your explanation exactly, but it seems that you are saying the meta warnings are not a problem.
Separately, can you explain why VRAM requirements appear so high for LoRA training of LLaVA 1.6 versus 1.5?
Ok, you can now load any checkpoint. See the OPTIONALLY LOAD A DIFFERENT CHECKPOINT
part of the script below the training.
Kindly close this issue if everything works.
Thanks @Linziyang1999 , appreciate the response. btw, device_map is already defaulting to 'auto' in builder.py (note that I've only changed from float16 to bfloat16 here):
# Copyright 2023 Haotian Liu # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import warnings import shutil from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig import torch from llava.model import * from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs): kwargs = {"device_map": device_map, **kwargs} if device != "cuda": kwargs['device_map'] = {"": device} if load_8bit: kwargs['load_in_8bit'] = True elif load_4bit: kwargs['load_in_4bit'] = True kwargs['quantization_config'] = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ) else: kwargs['torch_dtype'] = torch.bfloat16 if use_flash_attn: kwargs['attn_implementation'] = 'flash_attention_2' if 'llava' in model_name.lower(): # Load LLaVA model if 'lora' in model_name.lower() and model_base is None: warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.') if 'lora' in model_name.lower() and model_base is not None: from llava.model.language_model.llava_llama import LlavaConfig lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) print('Loading LLaVA from base model...') model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs) token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features if model.lm_head.weight.shape[0] != token_num: model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) print('Loading additional LLaVA weights...') if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')): non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu') else: # this is probably from HF Hub from huggingface_hub import hf_hub_download def load_from_hf(repo_id, filename, subfolder=None): cache_file = hf_hub_download( repo_id=repo_id, filename=filename, subfolder=subfolder) return torch.load(cache_file, map_location='cpu') non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin') non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()} if any(k.startswith('model.model.') for k in non_lora_trainables): non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()} model.load_state_dict(non_lora_trainables, strict=False) from peft import PeftModel print('Loading LoRA weights...') model = PeftModel.from_pretrained(model, model_path) print('Merging LoRA weights...') model = model.merge_and_unload() print('Model is loaded...') elif model_base is not None: # this may be mm projector only print('Loading LLaVA from base model...') if 'mpt' in model_name.lower(): if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')): shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py')) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True) cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu') mm_projector_weights = {k: v.to(torch.bfloat16) for k, v in mm_projector_weights.items()} model.load_state_dict(mm_projector_weights, strict=False) else: if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) elif 'mistral' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path) model = LlavaMistralForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = LlavaLlamaForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: # Load language model if model_base is not None: # PEFT model from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs) print(f"Loading LoRA weights from {model_path}") model = PeftModel.from_pretrained(model, model_path) print(f"Merging weights") model = model.merge_and_unload() print('Convert to FP16...') model.to(torch.bfloat16) else: use_fast = False if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) image_processor = None if 'llava' in model_name.lower(): mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False) mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True) if mm_use_im_patch_token: tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) if mm_use_im_start_end: tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) model.resize_token_embeddings(len(tokenizer)) vision_tower = model.get_vision_tower() if not vision_tower.is_loaded: print('vision_tower is not loaded so loading it now') vision_tower.load_model(device_map=device_map) vision_tower.to(device=device, dtype=torch.bfloat16) else: print('vision_tower is loaded') image_processor = vision_tower.image_processor if hasattr(model.config, "max_sequence_length"): context_len = model.config.max_sequence_length else: context_len = 2048 return tokenizer, model, image_processor, context_len
You say to update the
config
as an alternative. Where specifically should I do that?the reason why encounter with this warning is because from_pretrained() method allocate a meta device to vision encoder for low cpu memory using, so when you init you will get this warning, but it not a problem at all cause the paramters still load properly. To avoid this waring you can change the default load_model() method in clipvitiontower 、、、 #change to auto def load_model(self, device_map=“auto”): if self.is_loaded: print('{} is already loaded,
load_model
called again, skipping.'.format(self.vision_tower_name)) return #print("load model from", self.vision_tower_name) self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name) self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name,device_map = device_map) #input(device_map) #print(self.vision_tower.device.type) self.vision_tower.requiresgrad(False)self.is_loaded = True
、、、 you can also avoid this by set unfreeze_vison_tower = false in config in your model file (like llava-v1.6-mistral-7b/config), be careful if you use this method, you will load vision tower parameter from original clip-vit-large-patch14-336 rather than vit parameter save in llava model parameter. The original paramters was not finetuned you may using some method to change it or you will received irresponsible anwser. 、、、 "model.vision_tower.vision_tower.vision_model.embeddings.class_embedding": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.embeddings.position_embedding.weight": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00003-of-00004.safetensors", "model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00003-of-00004.safetensors", 、、、 you can use me method to change the paramters 、、、 import torch from safetensors.torch import load_file
model_weights = torch.load('./clip-vit-large-patch14-336/pytorch_model.bin') model_weight_llava = load_file('./llava-v1.6-mistral-7b/model-00003-of-00004.safetensors')
for layer_name, param in model_weights.items(): llava_key = "model.vision_tower.vision_tower." + layer_name if llava_key in model_weight_llava and "vision_model.encoder" in layer_name: if not torch.equal(model_weight_llava[llava_key], param): print(llava_key, " updated!" ) model_weights[layer_name] = model_weight_llava[llava_key]
torch.save(model_weights, './clip-vit-large-patch14-336/pytorch_model.bin') 、、、
thank you so much, you saved my day!@Linziyang1999
@Linziyang1999 Hi,
Thanks @Linziyang1999 , appreciate the response. btw, device_map is already defaulting to 'auto' in builder.py (note that I've only changed from float16 to bfloat16 here):
# Copyright 2023 Haotian Liu # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import os import warnings import shutil from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig import torch from llava.model import * from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", use_flash_attn=False, **kwargs): kwargs = {"device_map": device_map, **kwargs} if device != "cuda": kwargs['device_map'] = {"": device} if load_8bit: kwargs['load_in_8bit'] = True elif load_4bit: kwargs['load_in_4bit'] = True kwargs['quantization_config'] = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type='nf4' ) else: kwargs['torch_dtype'] = torch.bfloat16 if use_flash_attn: kwargs['attn_implementation'] = 'flash_attention_2' if 'llava' in model_name.lower(): # Load LLaVA model if 'lora' in model_name.lower() and model_base is None: warnings.warn('There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged.') if 'lora' in model_name.lower() and model_base is not None: from llava.model.language_model.llava_llama import LlavaConfig lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) print('Loading LLaVA from base model...') model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs) token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features if model.lm_head.weight.shape[0] != token_num: model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype)) print('Loading additional LLaVA weights...') if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')): non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu') else: # this is probably from HF Hub from huggingface_hub import hf_hub_download def load_from_hf(repo_id, filename, subfolder=None): cache_file = hf_hub_download( repo_id=repo_id, filename=filename, subfolder=subfolder) return torch.load(cache_file, map_location='cpu') non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin') non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()} if any(k.startswith('model.model.') for k in non_lora_trainables): non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()} model.load_state_dict(non_lora_trainables, strict=False) from peft import PeftModel print('Loading LoRA weights...') model = PeftModel.from_pretrained(model, model_path) print('Merging LoRA weights...') model = model.merge_and_unload() print('Model is loaded...') elif model_base is not None: # this may be mm projector only print('Loading LLaVA from base model...') if 'mpt' in model_name.lower(): if not os.path.isfile(os.path.join(model_path, 'configuration_mpt.py')): shutil.copyfile(os.path.join(model_base, 'configuration_mpt.py'), os.path.join(model_path, 'configuration_mpt.py')) tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=True) cfg_pretrained = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = LlavaMptForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) cfg_pretrained = AutoConfig.from_pretrained(model_path) model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs) mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu') mm_projector_weights = {k: v.to(torch.bfloat16) for k, v in mm_projector_weights.items()} model.load_state_dict(mm_projector_weights, strict=False) else: if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = LlavaMptForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) elif 'mistral' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path) model = LlavaMistralForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = LlavaLlamaForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, **kwargs ) else: # Load language model if model_base is not None: # PEFT model from peft import PeftModel tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, **kwargs) print(f"Loading LoRA weights from {model_path}") model = PeftModel.from_pretrained(model, model_path) print(f"Merging weights") model = model.merge_and_unload() print('Convert to FP16...') model.to(torch.bfloat16) else: use_fast = False if 'mpt' in model_name.lower(): tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs) else: tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) image_processor = None if 'llava' in model_name.lower(): mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False) mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True) if mm_use_im_patch_token: tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True) if mm_use_im_start_end: tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True) model.resize_token_embeddings(len(tokenizer)) vision_tower = model.get_vision_tower() if not vision_tower.is_loaded: print('vision_tower is not loaded so loading it now') vision_tower.load_model(device_map=device_map) vision_tower.to(device=device, dtype=torch.bfloat16) else: print('vision_tower is loaded') image_processor = vision_tower.image_processor if hasattr(model.config, "max_sequence_length"): context_len = model.config.max_sequence_length else: context_len = 2048 return tokenizer, model, image_processor, context_len
You say to update the
config
as an alternative. Where specifically should I do that?and this device_map is used in load llavamistralforcasuallm, this error heppend when you load clipvisionmodel, llava set device_map=none and low_cpu_mem_usage = false when load pretrain clipmodel, the _fast_init default as true in frpm_pretrained() method
init_contexts = [no_init_weights(_enable=_fast_init)]
so the model will init in meta device, and clipvison model save weight in one pytorch.bin, so this will call _load_state_dict_into_model() this is not design for load onto meta device so this method do nothing, and weight willl be loaded properly in llavamistralforcasuallm.from_pretrained(), and will call _load_state_dict_into_meta_model, this design for load in meta device. So the params will only load once. Hope this will help you , and if you have any question feel free to contact me.
Hi, I have encountered the same warning. I am using the Lora finetune model for eval. I set Device_map=auto and torch-d type to bf16, but it did not work. May I ask if you have any ideas for solving this problem? I loaded the clip vit large patch 14-336 downloaded locally. Thank you for your reply
Describe the issue
Issue/Error: Loading 1.5 models works fine, but loading 1.6 models yield the error below. Note that the 1.6 models do load (despite the error) and inference works. However, training the 1.6 model results in OOM (unlike the 1.5 models which train fine - specifically training 7B models in bf16 on a 48GB A6000).
Command:
Log: