huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
15.05k stars 1.44k forks source link

How to save / form the config.json after fine-tuning - Flan T5 11b #93

Closed sujithjoseph closed 1 year ago

sujithjoseph commented 1 year ago

After fine-tuning a flan t5 11b model on custom data, I was saving the checkpoint via accelerate like this

        accelerator.wait_for_everyone()
        accelerator.save(
            get_peft_model_state_dict(model, state_dict=accelerator.get_state_dict(model)), checkpoint_name
        )
        accelerator.wait_for_everyone() 

It didnt create the config.json needed to load the model. The checkpoint got created (cdcFT5_lora.pt) ~ 19 MB file.

I am trying to create it manually using parameters that I used for training, looking at some sample lora model files, for inference purposes. Should target_modules be

"target_modules": [ "q", "v" ],

OR

"target_modules": [ "query_key_value" ],

{
  "base_model_name_or_path": "./cdcFT5_lora.pt",
  "bias": "none",
  "enable_lora": [
    true,
    false,
    true
  ],
  "fan_in_fan_out": true,
  "inference_mode": true,
  "lora_alpha": 32,
  "lora_dropout": 0.1,
  "merge_weights": false,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 8,
  "target_modules": [
    "q",
    "v"
  ],
  "task_type": "SEQ_2_SEQ_LM"
}

What values should I give for "enable_lora": [ true, false, true ], "fan_in_fan_out": true,

For inference, should it be enable_lora as true and fan_in_fan_out as false?

How do I save the model with config.json directly as well?

Is it via

peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}"
accelerator.save_pretrained(peft_model_id)

I see model.save_pretrained() exists, not sure if this works as well - accelerator.save_pretrained(peft_model_id)

Anyway to load the checkpoint and create the config file as well, without a re-training?

sujithjoseph commented 1 year ago

I was able to re-create the config file with a smaller data set training and then saved it using
finalmodel = accelerator.unwrap_model(model)

    finalmodel.save_pretrained(peft_model_id)
sujithjoseph commented 1 year ago

how can i do inference easily using huggingface pipelines like this from a PeftModelForSeq2SeqLM model .

from transformers import pipeline

summarizer = pipeline("summarization", "cdcFT5lra", torch_dtype=torch.bfloat16)

raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
        f"{prompt} \n\n {raw_document}",
        num_beams=5,
        min_length=5,
        no_repeat_ngram_size=3,
        truncation=True,
        max_length=512,
    )

OR

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
    input_ids, 
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)

doesnt work . Gives an error

TypeError: generate() takes 1 positional argument but 2 were given

PEFT examples, uses datasets as input for inference . Is that the only way ?

pacman100 commented 1 year ago

Hello @sujithjoseph, for PEFT generate methods, one has to provide kwargs, could you try below change and let us know if that resolves the issue? Will add this point in caveats

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
-   input_ids, 
+   input_ids=input_ids,
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)
pacman100 commented 1 year ago

Also, you can use it with Pipelines via the below logic, although a warning will be displayed mentioning model might be unsupported which can be ignored because PeftModel isn't subclass of models such as T5 ...:

from transformers import SummarizationPipeline

summarizer = SummarizationPipeline(model= model, tokenizer= tokenizer)

raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
        f"{prompt} \n\n {raw_document}",
        num_beams=5,
        min_length=5,
        no_repeat_ngram_size=3,
        truncation=True,
        max_length=512,
    )

Let us know if above snippet helps in using pipeline

sujithjoseph commented 1 year ago

Thanks @pacman100 . Really Appreciate it! Had a follow up Q. I was trying to load the model with int-8


max_memory={0: "30GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu":"60GB"}
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory, load_in_8bit=True)
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory)

Got a runtime error RuntimeError: expected scalar type Half but found Float

By default, does it load up in bfloat16 or float16, if the model is trained in bfloat16?

sujithjoseph commented 1 year ago

fine-tuned flan-t5-xxl takes around 10-20 seconds on a single 40 GB A100 GPU to give answer for a prompt.. If there anything than can be done it to make it faster w/o using a smaller flan-t5 model.

mayank31398 commented 1 year ago

Try running in bf16 instead of fp32. Also, you can look at ONNX/TensorRT

pacman100 commented 1 year ago

Had a follow up Q. I was trying to load the model with int-8

To load model trained using Accelerate+DeepSpeed ZeRO-3, you can do the following. Below is an example for 3B model:

+ from peft import prepare_model_for_training
  peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
  config = PeftConfig.from_pretrained(peft_model_id)
  model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, 
             load_in_8bit=True, 
              device_map={'':0})
+ model = prepare_model_for_training(model)
  model = PeftModel.from_pretrained(model, peft_model_id)
  tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

Then running generate as usual:

%%time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(dataset["test"][i]["Tweet text"])
print(inputs)

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
    print(outputs)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
# ['complaint']

I ran below snippet in a jupyter cell for the following 3 settings:

from time import time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)
  1. For fp32, load directly without using device_map if you have enough GPU memory: model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
  2. for bf16, post loading the PeftModel, do model.to(torch.bfloat16)
precision inference wall time (ms)
FP32 96
BF16 105
INT8 370

@mayank31398, BF16 is taking more time than FP32, that is peculiar, usually with Fp16 models, latency is reduced by half but here it is increasing. To make sure this isn't related to PEFT I just l loaded the pretrained LLM can still see the same behaviour with latency of BF16 being more compared to FP32.

pacman100 commented 1 year ago

@sujithjoseph, device_map and load_in_8bit are used for low resource inference when suppose you have GPU with VRAM that can't fit the entire model; device_map offloads it to CPU or across smaller GPUs; load_in_8bit aims to fir such large models on given GPU by having weights in int8 precision.

For very low latencies, as @mayank31398 suggested, you would have to convert the model to ONNX/TensorRT; alternatively use flash attention, fused kernels ...

sujithjoseph commented 1 year ago

Thanks a lot @pacman100 @mayank31398! , This has been really insightful! I didn't know that converting the model to Tensor RT and serve via TRT inference server, would be more faster than peft + deepspeed zero3 for inference.

sujithjoseph commented 1 year ago

I also see quality issues on the fine-tuned flan-t5-xxl (on 500K records), unlike the original model. Its hallucinating a lot. I had used batch size as 1 , as I couldn't fit it for training on 8 40 GB A100s with batch size as 2 (it used to run for couple of hours and then go OOM) . and here are the train/eval ppl/loss epoch : 0 train_ppl : 133.7952117919922 train_epoch_loss : 4.896310329437256

eval_ppl : 1.5221441984176636 eval_epoch_loss : 0.4201200008392334

def generate_custom(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
     input_ids=input_ids, 
    min_length=256,
    max_new_tokens=1024,
    length_penalty=1.4,
    no_repeat_ngram_size=2,
    top_k=150,
    top_p=0.92,
    repetition_penalty=2.1,
    #num_beams=4,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)
mayank31398 commented 1 year ago

8x 40G A100s should be enough for PEFT training of FLAN. Can you tell me what backend you are using? Are you not using DeepSpeed?

sujithjoseph commented 1 year ago

Yes. DeepSpeed zero 3. It worked fine with batch size as 1, not 2. I am concerned if lower batch size is impacting model quality. I had 500K records as training set. Here is my config (deepspeed / accelerate)

deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true
distributed_type: DEEPSPEED
downcast_bf16: true
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

reference - https://github.com/microsoft/DeepSpeed/issues/2820

mayank31398 commented 1 year ago

i only see 4 processes in the yaml ^^ you can always enable cpu offloading

sujithjoseph commented 1 year ago

@mayank31398 I had started with 4 and expanded to 8 . My final config has num proc as 8. Doesnt this enable cpu offoading?

  offload_optimizer_device: cpu
  offload_param_device: cpu
sujithjoseph commented 1 year ago

I also had changed this in the final config - dynamo_backend: 'INDUCTOR'

sujithjoseph commented 1 year ago

If I shard the xxl base model like this

model.save_pretrained("sharded", max_shard_size="2000MB")

will it help in then finetuning it with larger batch size or should I load it int-8 n and fine-tune it with larger batch size which fits in memory. Not sure which one will result in higher quality model.

sujithjoseph commented 1 year ago

Since I have CUDA 11.6 driver installed (vertex ai), I was using torch 1.12.1+cu116 . During installation, I see this

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.2.0.dev0 requires torch>=1.13.0, but you have torch 1.12.1+cu116 which is incompatible.

Does peft really need 1.13.0 version of torch?. So far, I havent seen any issues with 1.12.1+cu116 with peft

sujithjoseph commented 1 year ago

@pacman100 , I am not able to import prepare_model_for_training from main. I did pip install -U git+https://github.com/huggingface/peft.git. Should I install this branch - https://github.com/huggingface/peft/tree/younesbelkada-flan-t5-xl ?

ImportError: cannot import name 'prepare_model_for_training' from 'peft' (/opt/conda/lib/python3.7/site-packages/peft/init.py) / I see it in https://github.com/huggingface/peft/blob/main/src/peft/utils/other.py . I see it in https://github.com/huggingface/peft/blob/main/src/peft/__init__.py as well. Probably need to uninstall and install again.

sujithjoseph commented 1 year ago

pip install --upgrade -e git+https://github.com/huggingface/peft.git#egg=peft pip install --upgrade git+https://github.com/huggingface/peft.git

This helped to fix it.

sujithjoseph commented 1 year ago

model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)
sujithjoseph commented 1 year ago
from time import time
model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)

Gives the below error AttributeError: 'NoneType' object has no attribute 'device'

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│    8 │   with torch.no_grad():                                                                   │
│    9 │   │   #with torch.cuda.amp.autocast():                                                    │
│   10 │   │   start = time()                                                                      │
│ ❱ 11 │   │   outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_token    │
│   12 │   │   times.append((time()-start)*1000)                                                   │
│   13 print(outputs)                                                                              │
│   14 print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))     │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/peft/peft_model.py:708 in generate                        │
│                                                                                                  │
│   705 │                                                                                          │
│   706 │   def generate(self, **kwargs):                                                          │
│   707 │   │   if not isinstance(self.peft_config, PromptLearningConfig):                         │
│ ❱ 708 │   │   │   return self.base_model.generate(**kwargs)                                      │
│   709 │   │   else:                                                                              │
│   710 │   │   │   if "input_ids" not in kwargs:                                                  │
│   711 │   │   │   │   raise ValueError("input_ids must be provided for Peft model generation")   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py:27 in decorate_context        │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:1248 in generate         │
│                                                                                                  │
│   1245 │   │   │   # if model is encoder decoder encoder_outputs are created                     │
│   1246 │   │   │   # and added to `model_kwargs`                                                 │
│   1247 │   │   │   model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(           │
│ ❱ 1248 │   │   │   │   inputs_tensor, model_kwargs, model_input_name                             │
│   1249 │   │   │   )                                                                             │
│   1250 │   │                                                                                     │
│   1251 │   │   # 5. Prepare `input_ids` which will be used for auto-regressive generation        │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:609 in                   │
│ _prepare_encoder_decoder_kwargs_for_generation                                                   │
│                                                                                                  │
│    606 │   │   model_input_name = model_input_name if model_input_name is not None else self.ma  │
│    607 │   │   encoder_kwargs["return_dict"] = True                                              │
│    608 │   │   encoder_kwargs[model_input_name] = inputs_tensor                                  │
│ ❱  609 │   │   model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)          │
│    610 │   │                                                                                     │
│    611 │   │   return model_kwargs                                                               │
│    612                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1075 in forward     │
│                                                                                                  │
│   1072 │   │   │   │   │   cross_attn_layer_head_mask=cross_attn_layer_head_mask,                │
│   1073 │   │   │   │   │   past_key_value=past_key_value,                                        │
│   1074 │   │   │   │   │   use_cache=use_cache,                                                  │
│ ❱ 1075 │   │   │   │   │   output_attentions=output_attentions,                                  │
│   1076 │   │   │   │   )                                                                         │
│   1077 │   │   │                                                                                 │
│   1078 │   │   │   # layer_outputs is a tuple with:                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:692 in forward      │
│                                                                                                  │
│    689 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    690 │   │   │   past_key_value=self_attn_past_key_value,                                      │
│    691 │   │   │   use_cache=use_cache,                                                          │
│ ❱  692 │   │   │   output_attentions=output_attentions,                                          │
│    693 │   │   )                                                                                 │
│    694 │   │   hidden_states, present_key_value_state = self_attention_outputs[:2]               │
│    695 │   │   attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs an  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:599 in forward      │
│                                                                                                  │
│    596 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    597 │   │   │   past_key_value=past_key_value,                                                │
│    598 │   │   │   use_cache=use_cache,                                                          │
│ ❱  599 │   │   │   output_attentions=output_attentions,                                          │
│    600 │   │   )                                                                                 │
│    601 │   │   hidden_states = hidden_states + self.dropout(attention_output[0])                 │
│    602 │   │   outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:511 in forward      │
│                                                                                                  │
│    508 │   │   │   return hidden_states                                                          │
│    509 │   │                                                                                     │
│    510 │   │   # get query states                                                                │
│ ❱  511 │   │   query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length,  │
│    512 │   │                                                                                     │
│    513 │   │   # get key/value states                                                            │
│    514 │   │   key_states = project(                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/peft/tuners/lora.py:456 in forward                        │
│                                                                                                  │
│   453 │   │   │   │   nn.init.zeros_(self.lora_B.weight)                                         │
│   454 │   │                                                                                      │
│   455 │   │   def forward(self, x: torch.Tensor):                                                │
│ ❱ 456 │   │   │   result = super().forward(x)                                                    │
│   457 │   │   │   if self.r > 0:                                                                 │
│   458 │   │   │   │   result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling    │
│   459 │   │   │   return result                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/nn/modules.py:242 in forward                 │
│                                                                                                  │
│   239 │   │   if self.bias is not None and self.bias.dtype != x.dtype:                           │
│   240 │   │   │   self.bias.data = self.bias.data.to(x.dtype)                                    │
│   241 │   │                                                                                      │
│ ❱ 242 │   │   out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)                 │
│   243 │   │   if not self.state.has_fp16_weights:                                                │
│   244 │   │   │   if self.state.CB is not None and self.state.CxB is not None:                   │
│   245 │   │   │   │   # we converted 8-bit row major to turing/ampere format in the first infe   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:488 in matmul         │
│                                                                                                  │
│   485 │   state = state or MatmulLtState()                                                       │
│   486 │   if threshold > 0.0:                                                                    │
│   487 │   │   state.threshold = threshold                                                        │
│ ❱ 488 │   return MatMul8bitLt.apply(A, B, out, bias, state)                                      │
│   489                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:320 in forward        │
│                                                                                                  │
│   317 │   │   │   │   │   state.CxB, state.SB = F.transform(state.CB, to_order=formatB)          │
│   318 │   │   else:                                                                              │
│   319 │   │   │   if not state.has_fp16_weights and state.CxB is None and using_igemmlt:         │
│ ❱ 320 │   │   │   │   state.CxB, state.SB = F.transform(state.CB, to_order=formatB)              │
│   321 │   │   │   subA = None                                                                    │
│   322 │   │                                                                                      │
│   323 │   │   # 2. Quantize B                                                                    │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/functional.py:1698 in transform              │
│                                                                                                  │
│   1695                                                                                           │
│   1696                                                                                           │
│   1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=N  │
│ ❱ 1698 │   prev_device = pre_call(A.device)                                                      │
│   1699 │   if state is None: state = (A.shape, from_order)                                       │
│   1700 │   else: from_order = state[1]                                                           │
│   1701 │   if out is None: out, new_state = get_transform_buffer(state[0], A.dtype, A.device, t  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'device'
sujithjoseph commented 1 year ago

This only happens when i load the model in 8-bit alone.

config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16)
device = torch.device("cuda")
model.cuda()
model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

either in 1 GPU or device:auto

pacman100 commented 1 year ago

@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.

pacman100 commented 1 year ago

This only happens when i load the model in 8-bit alone.

config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16)
device = torch.device("cuda")
model.cuda()
model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

either in 1 GPU or device:auto

Does adding device_map={'':0} to PeftModel.from_pretrained resolve the issues: model = PeftModel.from_pretrained(model, peft_model_id, device_map={'':0})

pacman100 commented 1 year ago

Also, may I know what is the input and output seq lengths of the dataset?

In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,

  1. Input seq length = 255
  2. output seq length = 50
  3. batch_size_per_gpu = 8 (so total batch size of 32=8*4)

Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b

I observe below memory stats:

GPU Memory before entering the train : 954
GPU Memory consumed at the end of the train (end-begin): 1020
GPU Peak Memory consumed during the train (max-begin): 31323
GPU Total Peak Memory consumed during the train (max): 32277
CPU Memory before entering the train : 7361
CPU Memory consumed at the end of the train (end-begin): 1034
CPU Peak Memory consumed during the train (max-begin): 1034
CPU Total Peak Memory consumed during the train (max): 8395

So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.

sujithjoseph commented 1 year ago

@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.

@pacman100 deepspeed==0.8.0

sujithjoseph commented 1 year ago

Also, may I know what is the input and output seq lengths of the dataset?

In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,

  1. Input seq length = 255
  2. output seq length = 50
  3. batch_size_per_gpu = 8 (so total batch size of 32=8*4)

Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b

I observe below memory stats:

GPU Memory before entering the train : 954
GPU Memory consumed at the end of the train (end-begin): 1020
GPU Peak Memory consumed during the train (max-begin): 31323
GPU Total Peak Memory consumed during the train (max): 32277
CPU Memory before entering the train : 7361
CPU Memory consumed at the end of the train (end-begin): 1034
CPU Peak Memory consumed during the train (max-begin): 1034
CPU Total Peak Memory consumed during the train (max): 8395

So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.

max length is 512 for both source and target.

sujithjoseph commented 1 year ago

Thanks a lot , @pacman100 ! This is awesome! I will reduce max length for input seq length. I am trying to see if I can pass a Q and if Flan T5 can generate an answer/context summary.

sujithjoseph commented 1 year ago

Does it help if I increase gradient accumulations steps to 4 from 1. Will it help in model accuracy, since I may be able to fit more batch size?

pacman100 commented 1 year ago

One thing that I just checked was enabling gradient_checkpointing which will recompute the activations of intermediate blocks instead of storing them. With that using the same codebase as above, the memory being consumed for input_seq-len=512 and output_seq_len=512 is 16GB per GPU. The changes to the code:

  model = AutoModelForSeq2SeqLM.from_pretrained(args.model_id)

+  if args.gradient_checkpointing:
+       model.gradient_checkpointing_enable()
+       def make_inputs_require_grad(module, input, output):
+           output.requires_grad_(True)

+       model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
+       model.config.use_cache=False

    # define LorA fine-tuning config
    if args.use_peft:
        peft_config = LoraConfig(
            task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
        )
        # Create PEFT model with LoraConfig
        model = get_peft_model(model, peft_config)
        model.print_trainable_parameters()

So, you can now easily train with batch_size of 8 per GPU with inputs and outputs of length 512. However, there is no free lunch, the training time increases because of the need to recompute activation. Also, if you are evaluating using generate then it will take a lot longer for evaluation because of setting model.config.use_cache=False as it is incompatible with gradient checkpointing.

However, to fit larger batches, I would first see what the 90 percentile or 80 percentlie lengths of my inputs and outputs are, for many use cases they can be lot less than 512. If they are indeed 512, I would then use gradient_checkpointing.

sujithjoseph commented 1 year ago

Thanks @pacman100 . This worked. Here are some stats for avg time for 100 inferences.

FP32 - w/o device map didnt fit in single 40 GB with my current code. FP32 with device map took 1982.5
BF16 with device map took - 2144.4 int-8 w/o device map took 10719 and it didnt yield the same response as BF16 or FP32.

sujithjoseph commented 1 year ago

@pacman100 , If we need to enable TF32 support instead of bf16, should I select --mixed_precision as 'no' and set

    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

in train function, since I am using A100s?

sujithjoseph commented 1 year ago

Thanks a bunch! With the above changes, I was able to squeeze in 24 as batch size with tf32 precision and 32 as batch size with bf16 mixed precision settings. source token max length - 62 , target- 512.

sujithjoseph commented 1 year ago

I do get this warning frequently. I assume, I can safely ignore this, rather than reducing the batch size.

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
mayank31398 commented 1 year ago

You are at your memory limit on the GPU. This generally slows down training.

sujithjoseph commented 1 year ago

@pacman100 , Unfortunately, it errored during the eval phase

ValueError: cannot insert level_0, already exists

Generate config GenerationConfig {██████████████████████████████████████████████████▍                                                            | 298/528 [5:53:15<4:30:43, 70.63s/it]
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.27.0.dev0",
  "use_cache": false
  }

 57%|██████████████████████████████████████████████████████████████████████████████▋                                                            | 299/528 [5:54:26<4:29:36, 70.64s/it]
                                                                                                 │
│ /home/jupyter/t5/flant5/c/cdc_lora_train.py:324 in training_function                             │
│                                                                                                  │
│   321 │   │   │   │   )                                                                          │
│   322 │   │   │   │   if (step+1)%args.tracking_steps==0:                                        │
│   323 │   │   │   │   │   pred_df = pd.concat([pred_df, pd.DataFrame({"decoded_preds": decoded   │
│ ❱ 324 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   "decoded_labels":decoded   │
│   325 │   │   │   │   │   accelerator.print(pred_df)                                             │
│   326 │   │   │   │                                                                              │
│   327 │   │   │   │   #break                                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py:311 in wrapper                 │
│                                                                                                  │
│   308 │   │   │   │   │   FutureWarning,                                                         │
│   309 │   │   │   │   │   stacklevel=stacklevel,                                                 │
│   310 │   │   │   │   )                                                                          │
│ ❱ 311 │   │   │   return func(*args, **kwargs)                                                   │
│   312 │   │                                                                                      │
│   313 │   │   return wrapper                                                                     │
│   314                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:5799 in reset_index                  │
│                                                                                                  │
│    5796 │   │   │   │   │   │   level_values, lab, allow_fill=True, fill_value=lev._na_value     │
│    5797 │   │   │   │   │   )                                                                    │
│    5798 │   │   │   │                                                                            │
│ ❱  5799 │   │   │   │   new_obj.insert(0, name, level_values)                                    │
│    5800 │   │                                                                                    │
│    5801 │   │   new_obj.index = new_index                                                        │
│    5802 │   │   if not inplace:                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4414 in insert                       │
│                                                                                                  │
│    4411 │   │   │   )                                                                            │
│    4412 │   │   if not allow_duplicates and column in self.columns:                              │
│    4413 │   │   │   # Should this be a different kind of error??                                 │
│ ❱  4414 │   │   │   raise ValueError(f"cannot insert {column}, already exists")                  │
│    4415 │   │   if not isinstance(loc, int):                                                     │
│    4416 │   │   │   raise TypeError("loc must be int")                                           │
│    4417                                                 

ValueError: cannot insert level_0, already exists

Code snippet

                if (step+1)%args.tracking_steps==0:
                    pred_df = pd.concat([pred_df, pd.DataFrame({"decoded_preds": decoded_preds,
                                                                "decoded_labels":decoded_labels})]).reset_index()
                    accelerator.print(pred_df)

Will try changing it to reset_index(drop=True) to see if this will get fixed.

JohnGiorgi commented 1 year ago

Hello @sujithjoseph, for PEFT generate methods, one has to provide kwargs, could you try below change and let us know if that resolves the issue? Will add this point in caveats

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
-   input_ids, 
+   input_ids=input_ids,
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)

This is already a long thread, so apologies for piling on but... this change makes it a bit harder to use PEFT with the transformers Seq2SeqTrainer as its call to model.generate does not pass the input_ids as a keyword arg:

generated_tokens = self.model.generate(
    generation_inputs,
    **gen_kwargs,
)

A simple fix would be to update this line of code

generated_tokens = self.model.generate(
-   generation_inputs,
+   input_ids=generation_inputs,
    **gen_kwargs,
)

I would be happy to PR this to Transformers if this is correct and there's no good reason for generation_inputs to be a positional argument.

smolskayanastassia commented 1 year ago

@pacman100 @sujithjoseph @JohnGiorgi @mayank31398 Could you please help how to convert Peft model to onnx using optimum?

sujithjoseph commented 1 year ago

@pacman100 , I now run into a new inference issue, which I didnt see earlier with a VM with just 1 A100 GPU - 40G,

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 

from transformers import AutoModelForSeq2SeqLM
from peft import PeftModel, PeftConfig, PeftModelForSeq2SeqLM
import torch
from datasets import load_dataset

from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from transformers import default_data_collator,get_linear_schedule_with_warmup
from tqdm import tqdm
from datasets import load_dataset

#from peft import prepare_model_for_training
from peft import prepare_model_for_int8_training

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
batch_size=1
max_memory = {0: "39GIB", "cpu": "70GB"}

peft_model_id = "model-2-21"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16, max_memory=max_memory)
model = prepare_model_for_int8_training(model)

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

model = PeftModelForSeq2SeqLM.from_pretrained(model, peft_model_id, torch_dtype=torch.bfloat16 ,max_memory=max_memory)

model.to(torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

Error :

RuntimeError: Attempting to deserialize object on CUDA device 6 but torch.cuda.device_count() is 1. Please use 
torch.load with map_location to map your storages to an existing device.

the error is same , even when i use device_map={'':0} or auto.

It is also not working with device_map={'':0} with load_in_8_bit as true .

sujithjoseph commented 1 year ago

model.hf_device_map - this is what i see from the finetuned Peft model. Not sure the reason behind the difference. This was created from an interim checkpoint

                if (step+1)%args.tracking_steps==0:
                    pred_df = pd.concat([pred_df, pd.DataFrame({"decoded_preds": decoded_preds,
                                                                "decoded_labels":decoded_labels})]).reset_index(drop=True)
                    accelerator.print(pred_df)
                    #checkpoint at every tracking step
                    accelerator.wait_for_everyone()
                    unwrapped_model = accelerator.unwrap_model(model)
                    unwrapped_model.save_pretrained(
                    args.output_dir,
                    is_main_process=accelerator.is_main_process,
                    save_function=accelerator.save,
                    state_dict=accelerator.get_state_dict(model),
                )

@mayank31398  @pacman100 @younesbelkada 

{'shared': 0, 'decoder.embed_tokens': 0, 'encoder': 0, 'decoder.block.0': 0, 'decoder.block.1': 0, 'decoder.block.2': 0, 'decoder.block.3': 1, 'decoder.block.4': 1, 'decoder.block.5': 1, 'decoder.block.6': 1, 'decoder.block.7': 1, 'decoder.block.8': 1, 'decoder.block.9': 1, 'decoder.block.10': 1, 'decoder.block.11': 1, 'decoder.block.12': 1, 'decoder.block.13': 1, 'decoder.block.14': 1, 'decoder.block.15': 1, 'decoder.block.16': 1, 'decoder.block.17': 1, 'decoder.block.18': 1, 'decoder.block.19': 1, 'decoder.block.20': 1, 'decoder.block.21': 1, 'decoder.block.22': 1, 'decoder.block.23': 1, 'decoder.final_layer_norm': 1, 'decoder.dropout': 1, 'lm_head': 1}

Below is from https://huggingface.co/ybelkada/flan-t5-large-financial-phrasebank-lora . This loads up fine w/o any issues using the same code. 

model.hf_device_map

{'base_model.model.shared': 0, 'base_model.model.decoder.embed_tokens': 0, 'base_model.model.encoder': 0, 'base_model.model.decoder.block.0': 0, 'base_model.model.decoder.block.1': 0, 'base_model.model.decoder.block.2': 1, 'base_model.model.decoder.block.3': 1, 'base_model.model.decoder.block.4': 1, 'base_model.model.decoder.block.5': 1, 'base_model.model.decoder.block.6': 1, 'base_model.model.decoder.block.7': 1, 'base_model.model.decoder.block.8': 1, 'base_model.model.decoder.block.9': 1, 'base_model.model.decoder.block.10': 1, 'base_model.model.decoder.block.11': 1, 'base_model.model.decoder.block.12': 1, 'base_model.model.decoder.block.13': 1, 'base_model.model.decoder.block.14': 1, 'base_model.model.decoder.block.15': 1, 'base_model.model.decoder.block.16': 1, 'base_model.model.decoder.block.17': 1, 'base_model.model.decoder.block.18': 1, 'base_model.model.decoder.block.19': 1, 'base_model.model.decoder.block.20': 1, 'base_model.model.decoder.block.21': 1, 'base_model.model.decoder.block.22': 1, 'base_model.model.decoder.block.23': 1, 'base_model.model.decoder.final_layer_norm': 1, 'base_model.model.decoder.dropout': 1, 'base_model.model.lm_head': 1}

sujithjoseph commented 1 year ago

@pacman100 @sujithjoseph @JohnGiorgi @mayank31398 Could you please help how to convert Peft model to onnx using optimum?

I am also looking for the same info.

sujithjoseph commented 1 year ago

adapters_weights = torch.load(filename) in PeftModel class . Does it need a map_location passed to it as well?

pacman100 commented 1 year ago

Hello, this thread has become so long to follow. Please raise separate issues for anything deviating from this original issue. @sujithjoseph I think the original issue has been resolved. If so, could you please close this and open new one for the recent one that you are facing.

ianbstewart commented 1 year ago

I'm facing a similar problem (w/ @sujithjoseph) where the config.json file is not saved during training, which makes it harder to load the model after training. Has there been a fix for this?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.