Gaiejj commented 2 months ago

System Info

transformers version: 4.43.1
Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
Python version: 3.11.9
Huggingface_hub version: 0.24.1
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H800

Who can help?

@ArthurZucker When using deepspeed ZeRO3 to train the llama2-7b-hf model, I encountered an error during the resize_embedding process that I couldn't resolve. The llama2-7b-hf tokenizer lacks a pad_token, so I specified a default value for it, which requires resizing the embedding. However, this command executes correctly in transformers version 4.41.2 but fails in version 4.43.0.

I identified the following two anomalies:

Abnormal tensor shape

    params = [embeddings.weight]
    # embeddings.weight.size(0) is 32001 here
    context = (
        deepspeed.zero.GatheredParameters(params, modifier_rank=0)
        if is_deepspeed_zero3_enabled()
        else contextlib.nullcontext()
    )
    with context:
        for param in params:
            if param is None:
                continue
            assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
            # bug here, param size is 32000 while new_num_embeddings is 32001, in 4.43.0 transformers
            param_data = param.data
            param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
            param_data[-num_new_embeddings:] = param_mean

Abnormal ds_id
```
    params = [embeddings.weight]
    print(hasattr(embeddings.weight, 'ds_id'))
    # True for transformers 4.43.0, False for transformers 4.41.2
```
I've spent a lot of time pinpointing this issue, but I genuinely don't know how to resolve it. I sincerely hope you can provide assistance. This would be incredibly helpful, and I express my heartfelt gratitude to you.

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

The python file:


import torch
import deepspeed
import json

from transformers import ( AutoModelForCausalLM, AutoTokenizer )

from transformers.integrations.deepspeed import HfDeepSpeedConfig

DEFAULT_BOS_TOKEN: str = '~~' DEFAULT_EOS_TOKEN: str = '~~' DEFAULT_PAD_TOKEN: str = '' DEFAULT_UNK_TOKEN: str = ''

model_name_or_path = 'PATHTO/Llama-2-7b-hf' ds_cfgs_path = 'PATH'

deepspeed.init_distributed()

with open(ds_cfgs_path) as f: ds_cfgs = json.load(f) ds_cfgs['bf16']['enabled'] = True

dstchf = HfDeepSpeedConfig(ds_cfgs)

tokenizer = AutoTokenizer.from_pretrained( model_name_or_path, model_max_length=2048, padding_side='right', trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True, )

train.py

def resize_tokenizer_embedding(tokenizer, model) -> None: """Resize tokenizer and embedding.

Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
def init_new_embeddings(
    embeddings,
    new_num_embeddings: int,
    num_new_embeddings: int,
) -> None:
    if embeddings is None:
        return

    params = [embeddings.weight]
    print(hasattr(embeddings.weight, 'ds_id'))
    # True for transformers 4.43.1, False for transformers 4.41.2
    exit()
    context = (
        deepspeed.zero.GatheredParameters(params, modifier_rank=0)
        if is_deepspeed_zero3_enabled()
        else contextlib.nullcontext()
    )
    with context:
        for param in params:
            if param is None:
                continue
            assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
            # bug here, param size is 32000 while new_num_embeddings is 32001
            param_data = param.data
            param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
            param_data[-num_new_embeddings:] = param_mean

special_tokens_dict = {}
if tokenizer.pad_token is None:
    special_tokens_dict['pad_token'] = DEFAULT_PAD_TOKEN
if tokenizer.eos_token is None:
    special_tokens_dict['eos_token'] = DEFAULT_EOS_TOKEN
if tokenizer.bos_token is None:
    special_tokens_dict['bos_token'] = DEFAULT_BOS_TOKEN
if tokenizer.unk_token is None:
    special_tokens_dict['unk_token'] = DEFAULT_UNK_TOKEN

num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
new_num_embeddings = len(tokenizer)

model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

if num_new_tokens > 0:
    hf_device_map = getattr(model, 'hf_device_map', {})
    devices = {
        torch.device(device)
        for device in hf_device_map.values()
        if device not in {'cpu', 'disk'}
    }
    is_model_parallel = len(devices) > 1

    if not is_model_parallel:
        model.resize_token_embeddings(new_num_embeddings)

        init_new_embeddings(
            model.get_input_embeddings(),
            new_num_embeddings=new_num_embeddings,
            num_new_embeddings=num_new_tokens,
        )
        init_new_embeddings(
            model.get_output_embeddings(),
            new_num_embeddings=new_num_embeddings,
            num_new_embeddings=num_new_tokens,
        )

resize_tokenizer_embedding(tokenizer=tokenizer, model=model)

2. The deepspeed start bash
```bash
deepspeed \
 --master_port 12345 \
 --module debug.py \

The ds cfgs:


{
"train_batch_size": 128,
"train_micro_batch_size_per_gpu": 16,
"gradient_accumulation_steps": null,
"steps_per_print": 10,
"zero_optimization": {
  "stage": 3,
  "offload_param": {
      "device": "none"
  },
  "offload_optimizer": {
      "device": "none"
  },
  "param_persistence_threshold": 1e4,
  "max_live_parameters": 1e8,
  "prefetch_bucket_size": 3e7,
  "memory_efficient_linear": false,
  "gather_16bit_weights_on_model_save": true
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false,
"hybrid_engine": {
  "enabled": false,
  "max_out_tokens": 512,
  "inference_tp_size": 1,
  "release_inference_cache": false,
  "pin_parameters": true,
  "tp_gather_partition_size": 8
},
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": false
}
}



### Expected behavior

Correctly resizing. Thanks!

ArthurZucker commented 2 months ago

Hey! I think #32192 should have fixed it!

seokhyunan commented 2 months ago

It seems the issue is still not fixed. You can check the progress in #32192.

Gaiejj commented 2 months ago

Thank you very much for your prompt response and continuous follow-up. I will closely monitor the latest updates. Thanks again for your hard work! ❤️

seokhyunan commented 2 months ago

This issue is resolved by #32214! Thanks to @zucchini-nlp.

ArthurZucker commented 2 months ago

On my way to do a patch then! Thanks all for reporting this quickly, and thanks @zucchini-nlp for your quick fixes!

Gaiejj commented 2 months ago

Congratulations❤️ ! We have successfully executed full-parameter PPO fine-tuning on Llama 3.1. Thanks again to @ArthurZucker @iamseokhyun and @zucchini-nlp for their super quick effort and follow-up!!!

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 1 month ago

Closing as completed!

huggingface / transformers

error occur in the resize_embedding #32196

System Info

Who can help?

Information

Tasks

Reproduction

Reference: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py