huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.63k stars 26.7k forks source link

error occur in the resize_embedding #32196

Closed Gaiejj closed 1 month ago

Gaiejj commented 2 months ago

System Info

Who can help?

@ArthurZucker When using deepspeed ZeRO3 to train the llama2-7b-hf model, I encountered an error during the resize_embedding process that I couldn't resolve. The llama2-7b-hf tokenizer lacks a pad_token, so I specified a default value for it, which requires resizing the embedding. However, this command executes correctly in transformers version 4.41.2 but fails in version 4.43.0.

I identified the following two anomalies:

  1. Abnormal tensor shape
        params = [embeddings.weight]
        # embeddings.weight.size(0) is 32001 here
        context = (
            deepspeed.zero.GatheredParameters(params, modifier_rank=0)
            if is_deepspeed_zero3_enabled()
            else contextlib.nullcontext()
        )
        with context:
            for param in params:
                if param is None:
                    continue
                assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
                # bug here, param size is 32000 while new_num_embeddings is 32001, in 4.43.0 transformers
                param_data = param.data
                param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
                param_data[-num_new_embeddings:] = param_mean
  2. Abnormal ds_id
        params = [embeddings.weight]
        print(hasattr(embeddings.weight, 'ds_id'))
        # True for transformers 4.43.0, False for transformers 4.41.2

    I've spent a lot of time pinpointing this issue, but I genuinely don't know how to resolve it. I sincerely hope you can provide assistance. This would be incredibly helpful, and I express my heartfelt gratitude to you.

Information

Tasks

Reproduction

  1. The python file:
    
    import torch
    import deepspeed
    import json

from transformers import ( AutoModelForCausalLM, AutoTokenizer )

from transformers.integrations.deepspeed import HfDeepSpeedConfig

DEFAULT_BOS_TOKEN: str = '' DEFAULT_EOS_TOKEN: str = '' DEFAULT_PAD_TOKEN: str = '' DEFAULT_UNK_TOKEN: str = ''

model_name_or_path = 'PATHTO/Llama-2-7b-hf' ds_cfgs_path = 'PATH'

deepspeed.init_distributed()

with open(ds_cfgs_path) as f: ds_cfgs = json.load(f) ds_cfgs['bf16']['enabled'] = True

dstchf = HfDeepSpeedConfig(ds_cfgs)

tokenizer = AutoTokenizer.from_pretrained( model_name_or_path, model_max_length=2048, padding_side='right', trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True, )

Reference: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py

def resize_tokenizer_embedding(tokenizer, model) -> None: """Resize tokenizer and embedding.

Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
def init_new_embeddings(
    embeddings,
    new_num_embeddings: int,
    num_new_embeddings: int,
) -> None:
    if embeddings is None:
        return

    params = [embeddings.weight]
    print(hasattr(embeddings.weight, 'ds_id'))
    # True for transformers 4.43.1, False for transformers 4.41.2
    exit()
    context = (
        deepspeed.zero.GatheredParameters(params, modifier_rank=0)
        if is_deepspeed_zero3_enabled()
        else contextlib.nullcontext()
    )
    with context:
        for param in params:
            if param is None:
                continue
            assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
            # bug here, param size is 32000 while new_num_embeddings is 32001
            param_data = param.data
            param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
            param_data[-num_new_embeddings:] = param_mean

special_tokens_dict = {}
if tokenizer.pad_token is None:
    special_tokens_dict['pad_token'] = DEFAULT_PAD_TOKEN
if tokenizer.eos_token is None:
    special_tokens_dict['eos_token'] = DEFAULT_EOS_TOKEN
if tokenizer.bos_token is None:
    special_tokens_dict['bos_token'] = DEFAULT_BOS_TOKEN
if tokenizer.unk_token is None:
    special_tokens_dict['unk_token'] = DEFAULT_UNK_TOKEN

num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
new_num_embeddings = len(tokenizer)

model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

if num_new_tokens > 0:
    hf_device_map = getattr(model, 'hf_device_map', {})
    devices = {
        torch.device(device)
        for device in hf_device_map.values()
        if device not in {'cpu', 'disk'}
    }
    is_model_parallel = len(devices) > 1

    if not is_model_parallel:
        model.resize_token_embeddings(new_num_embeddings)

        init_new_embeddings(
            model.get_input_embeddings(),
            new_num_embeddings=new_num_embeddings,
            num_new_embeddings=num_new_tokens,
        )
        init_new_embeddings(
            model.get_output_embeddings(),
            new_num_embeddings=new_num_embeddings,
            num_new_embeddings=num_new_tokens,
        )

resize_tokenizer_embedding(tokenizer=tokenizer, model=model)

2. The deepspeed start bash
```bash
deepspeed \
 --master_port 12345 \
 --module debug.py \
  1. The ds cfgs:
    
    {
    "train_batch_size": 128,
    "train_micro_batch_size_per_gpu": 16,
    "gradient_accumulation_steps": null,
    "steps_per_print": 10,
    "zero_optimization": {
      "stage": 3,
      "offload_param": {
          "device": "none"
      },
      "offload_optimizer": {
          "device": "none"
      },
      "param_persistence_threshold": 1e4,
      "max_live_parameters": 1e8,
      "prefetch_bucket_size": 3e7,
      "memory_efficient_linear": false,
      "gather_16bit_weights_on_model_save": true
    },
    "gradient_clipping": 1.0,
    "prescale_gradients": false,
    "wall_clock_breakdown": false,
    "hybrid_engine": {
      "enabled": false,
      "max_out_tokens": 512,
      "inference_tp_size": 1,
      "release_inference_cache": false,
      "pin_parameters": true,
      "tp_gather_partition_size": 8
    },
    "fp16": {
    "enabled": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
    },
    "bf16": {
    "enabled": false
    }
    }


### Expected behavior

Correctly resizing. Thanks!
ArthurZucker commented 2 months ago

Hey! I think #32192 should have fixed it!

seokhyunan commented 2 months ago

It seems the issue is still not fixed. You can check the progress in #32192.

Gaiejj commented 2 months ago

Thank you very much for your prompt response and continuous follow-up. I will closely monitor the latest updates. Thanks again for your hard work! ❤️

seokhyunan commented 2 months ago

This issue is resolved by #32214! Thanks to @zucchini-nlp.

ArthurZucker commented 2 months ago

On my way to do a patch then! Thanks all for reporting this quickly, and thanks @zucchini-nlp for your quick fixes!

Gaiejj commented 2 months ago

Congratulations❤️ ! We have successfully executed full-parameter PPO fine-tuning on Llama 3.1. Thanks again to @ArthurZucker @iamseokhyun and @zucchini-nlp for their super quick effort and follow-up!!!

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 1 month ago

Closing as completed!