osainz59 commented 1 year ago

System Info

transformers version: 4.29.0.dev0
Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.28
Python version: 3.9.7
Huggingface_hub version: 0.13.3
Safetensors version: 0.3.0
PyTorch version (GPU?): 2.1.0.dev20230411+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

As mentioned on the title, the LLaMA tokenizer does not add the eos_token at the end of the inputs. This only happens on the fast version (use_fast=True).

Steps to reproduce the behaviour:

Load the LLaMA tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA_PATH, add_eos_token=True, use_fast=True)

Tokenize something

simple_sentence = "This is a sentence to test if the tokenizer adds eos token."
simple_sentence_ids = tokenizer(
simple_sentence, add_special_tokens=True
).input_ids

Print the input_ids to check if the eos_token_id (2) is added at the end.
```
print(simple_sentence_ids)
```

Output:

[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889]

Expected behavior

Expected output

[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 321, 359, 5993, 29889, 2]

ArthurZucker commented 1 year ago

Yes! Quick fix, use the slow tokenizer. Otherwise I'll open a PR to add template processing! Thanks for reporting!

gorjanradevski commented 1 year ago

But it shouldn't add an eos token right? The LM is not trained to generate a token after the eos I believe.

osainz59 commented 1 year ago

But it shouldn't add an eos token right? The LM is not trained to generate a token after the eos I believe.

By default, but if specified with add_eos_token=True it should. You can always fine-tune the model to make the model learn when to stop.

Elfsong commented 1 year ago

I guess they would set the pad_token_id using the eos_token_id? model.config.pad_token_id = model.config.eos_token_i

ndvbd commented 1 year ago

Same here, doing add_eos_token=True doesn't do anything

ArthurZucker commented 1 year ago

This should have been fixed by #22959

jonathangomesselman commented 1 year ago

I guess they would set the pad_token_id using the eos_token_id? model.config.pad_token_id = model.config.eos_token_i

I believe if you just set the pad_token = eos_token the model still is not learning to predict the eos_token because the corresponding attn_mask does not include the token and the labels ignores that token - i.e. no loss is computed for it. Not 100% sure about this, but that was what it seemed like from some self exploration.

avacaondata commented 1 year ago

The same is happening with Falcon...

ArthurZucker commented 1 year ago

When you say the same, what do you mean?

avacaondata commented 1 year ago

That it doesn't generate <|endoftext|> (token id 11) when calling generate, therefore it never stops generating. I have tried by setting eos_token_id to 193, which corresponds to \n, but I don't think that's a clean fix. I have noticed that when tokenizing the inputs with the Falcon-40b tokenizer, it's not adding eos_token_id at the end of input ids.

ArthurZucker commented 1 year ago

Few things here. Llama has no official model so make sure the one you are using is up to date and has the same eos token id for the model.config / generation config and the tokenizer.

For falcon, code is on the hub, but latest code of transformers adds the eos if you set “add_eos=True”. In the doc for llama you can find that initializing a model with “add_eos=True” will make it add the eos when tokenizing.

avacaondata commented 1 year ago

Actually I was talking about Falcon, not llama, because I'm facing an issue similar to the ones people are reporting with Llama. In fact I upgraded my transformers version to the last version in main branch, and the problem persists... The model never generates a EOS token, so it never stops generating... I have tried to explicitly add a string "<|endoftext|>" at the end of the inputs for fine-tuning, but still doesn't work.

What can I do to make falcon generate a eos token ?

ArthurZucker commented 1 year ago

The issue is different, the model not stopping does not mean that it is not adding the eos_token but rather not predicting it. The problem with LLAM has already been mentioned here: #23230

avacaondata commented 1 year ago

I thought it could be related, my hypothesis was that Falcon wasn't generating the EOS token because it wasn't being included in the inputs when tokenizing, so when we train the model over inputs without EOS token at the end, the model doesn't learn to generate EOS token.

jonathangomesselman commented 1 year ago

@avacaondata - I have noticed this same issue, where the model is not learning to predict the EOS token. After doing some digging through several examples and source code, I've noticed something a bit strange particularly related to the DataCollatorForLanguageModeling. A very typical pattern that I have seen suggested is the following:

transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

However, the problem I see with this approach is that when the DataCollator overrides OR generates the labels field for the batch it sets all tokens == pad_token to be -100.

labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
    labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels

Since the CrossEntropy loss ignores tokens with -100 even if the tokenizer we are using properly adds the eos_token, the loss function will actually ignore this token.

Ways that I have worked around this issue are either (1) to ensure that the eos_token_id != pad_token_id and make sure that the tokenizer includes the eos_token when tokenizing (some automatically do this such as the T5 tokenizer) OR (2) create the labels column myself when tokenizing - by cloning input_ids - and then using the DataCollatorForSeq2Seq. I actually really like the DataCollatorForSeq2Seq because it automatically pads the inputs and labels, but does not mess with tokens in unexpected ways, such as the eos_token.

Hope this is helpful!

avacaondata commented 1 year ago

@jonathangomesselman Thank you very much for the clear explanation, it makes much sense!

I will change the label for the eos token so that it's not ignored by cross entropy anymore.

Ideally I think that for instruction-tuning we shouldn't use DataCollatorForLanguageModeling, in this paper they did some experiments and found that only training over outputs typically works better: https://arxiv.org/pdf/2305.14314.pdf . However, I haven't found a way to make DataCollatorForSeq2Seq work for decoder-only models such as Llama or Falcon. Do you have any code on how to do that?

jonathangomesselman commented 1 year ago

@avacaondata - You're welcome!

I have generally followed this practice as well - just fine-tuning over the model outputs, since generally I don't need the model to directly learn the statistical distribution over human instructions, but rather just how to "react" to them.

Continuing from above, to use the DataCollatorForSeq2Seq for decoder-only models we need to manually create the labels field when tokenizing our data - i.e. ensuring we have the fields input_ids, attention_mask, and labels. Since we create the labels ourselves we have control over what tokens we explicitly train over vs. which we want to ignore (using -100 as a label). Here is the skeleton of some code you could use to tokenize the inputs:

from transformers import LlamaTokenizerFast

tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
# By default the bos_token is added and not the eos_token. For instruction tuning I often ignore bos_token.
tokenizer.add_bos_token = False
tokenizer.add_eos_token = True

def create_instruction_tuned_format(data_row):
  return f"""<User Instruction>:{data_row["instruct"]}
<Agent Response>: {data_row['response']}
""".strip()

def tokenize(data_row):
  """Format and tokenize instruction tuning data

  1) Combine the user input (instruction) and agent response
  2) Create `labels` - ensuring we only fine tune over the 
  desired agent response
  """
  model_input_text = create_instruction_tuned_format(data_row)
  # Tokenize the full model input
  model_input = tokenizer(
        model_input_text, 
        truncation=True,
        padding=False,
        return_tensors=None
  )

  # Create `labels` - ignoring user input (instructions)
  agent_response = tokenizer(data_row['title']).input_ids
  num_tokens_ignore = len(model_input['labels']) - len(agent_response)
  ignored_tokens = [-100] * (num_tokens_ignore)
  # Copy over the ids for the desired agent response
  model_input['labels'] = ignored_tokens \
                            + model_input['input_ids'][-len(agent_response):]

  # Just to demonstrate length equality
  assert len(model_inputs['labels']) == len(model_inputs['input_ids'])

  return model_input

tokenized_ds = ds.map(tokenizer, remove_columns=ds.column_names)

A couple of things to note/highlight:

We combine the user instruction and agent response using a very simple format. In the LIMA paper for example they introduce a new EOT (end-of-turn) token to separate the instruction and the response.
We tokenize the response to figure out the number of fine-tuning tokens at the end of the full token sequence.

Now that we have our data tokenized and formatted we can use the DataCollatorForSeq2Seq as follows:

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForSeq2Seq(
    tokenizer, return_tensors="pt", padding=True
)

batch_size = 8
train_dataloader = DataLoader(
    tokenized_ds, shuffle=True, collate_fn=data_collator, batch_size=batch_size, pin_memory=True
)

Note that the LLAMA tokenizer by default does not have a pad_token so we have to set it. Because we are using the DataCollatorForSeq2Seq it is okay for us to set the padding token to the eos_token as the collator does not create the labels tensor but rather just pads our existing labels tensor with -100 - i.e. the eos_token will not be ignored/replaced.

This may not be the most standard approach for doing this - but this is an example of what I have found to work / have seen some repos roughly follow. The main idea being that by creating the labels ourselves we are able to set -100 for tokens that we don't want to fine-tune over + ensure that we learn to generate the eos_token.

avacaondata commented 1 year ago

Wow @jonathangomesselman Thank you so much for the so clear explanation... :heart_eyes:

I tried it and yes it works flawlessly. I will check the LIMA paper in detail too to check for that EOT special token, I think that's an interesting approach.

Again, thank you so much, you were extremely helpful!! :heart:

jonathangomesselman commented 1 year ago

@avacaondata you're welcome! I had very similar questions to what you asked and found myself a bit surprised to not find many good resources. Thankfully the HuggingFace code repos are actually quite readable, especially in separating the complex model logic of the base pre-trained transform models (encoder-decoder + decoder only) vs. adding the "language modeling" head (see sub-classes with ...ConditionalGeneration, ...CausalLM, ...LMHeadModel).

If you're curious yourself, I would definitely recommend looking at the code to learn more. Each model has a slightly different naming convention but you will see that the logic is nearly identical. Some to check out are:

T5ForConditionalGeneration (encoder-decoder)
LlamaForCausalLM (decoder-only)
GPT2LMHeadModel (decoder-only)

Have fun exploring!

georgesung commented 1 year ago

@jonathangomesselman thanks a lot!

I was also running into this issue where the model was unable to output the eos_token after fine-tuning. I also followed examples where they set tokenizer.pad_token = tokenizer.eos_token. From your earlier comment, I made sure tokenizer.pad_token != tokenizer.eos_token by setting tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and using DataCollatorForLanguageModeling as before, e.g.

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Now the model finally outputs the eos_token as intended!

jonathangomesselman commented 1 year ago

@georgesung Thanks for sharing this approach! Adding a new [PAD] token is a great way to differentiate between that and the EOS token - which as you say allows you to then use the native DataCollatorForLanuageModeling. It is very interesting / odd to me that this is such a common problem, given it seems sort of obvious that we want this behavior. But regardless it is exciting to see the model finally start outputting the eos_token 😅 . An interesting thing that I noticed is that this is generally not an issue with the Encoder-Decoder models such as T5. With these models the tokenizer generally adds the eos_token by default and the colaters used don't have this problem of ignoring the eos_token by treating it as a padding token.

@avacaondata We can use a similar approach to add a the EOT token described in the LIMA Paper for separating the instruction and the response.

ArthurZucker commented 1 year ago

I think this could be a great TIP addition to the documentation / blog! If anyone of you has time to open PR, feel free to do so and ping me! 🤗

jonathangomesselman commented 1 year ago

@ArthurZucker - I would be happy to work on this! Where do you think it would be best to add this TIP?

ArthurZucker commented 1 year ago

Probably in the llama.md!

brando90 commented 1 year ago

What is the correct code for Falcon? I'm still puzzled.

Related links:

brando90 commented 1 year ago

@georgesung question:

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

But this assumes the model has a pad_token. I think an additional check has to be done that it does have an embedding for pad_token so that there are no run time errors (~type errors in the matrix extraction from the embedding "table"/matrix).

But if one does that some care might be needed to initialize the new token so that it dominates the generation: https://nlp.stanford.edu/~johnhew/vocab-expansion.html

georgesung commented 1 year ago

@brando90

But this assumes the model has a pad_token

I haven't confirmed, but I think tokenizer.add_special_tokens({'pad_token': '[PAD]'}) is equivalent to tokenizer.pad_token = '[PAD]' (edit: might be wrong about that). So if there are runtime errors with tokenizer.add_special_tokens({'pad_token': '[PAD]'}) then there would also be runtime errors with tokenizer.pad_token = tokenizer.eos_token -- note tokenizer.eos_token is just a string. But I observed runtime errors with neither. I just observed that when I set tokenizer.pad_token = tokenizer.eos_token during training, the model won't stop generating during inference, since it was trained to not output the eos token (per discussions above).

Since I was working with open_llama_7b, I saw that even though the model's tokenizer didn't specify a pad token string in its tokenizer_config.json, it still had a row in its token embedding matrix for the pad token. If you run print(model), you can see its token embedding matrix has an index reserved for the pad token (idx 0 in this case):

> print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
..

You can also see the pad token's embedding itself: model.state_dict()['model.embed_tokens.weight'][0]. Although from discussions above and also this discussion, it doesn't seem to matter what the actual embeddings are for the pad token.

brando90 commented 1 year ago

@georgesung unfortunately I'm working with Falcon. It doesn't have a pad token to my surprise (I'm not sure how this even happens in the first place tbh):

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:10<00:00,  1.36s/it]
type(model)=<class 'transformers_modules.tiiuae.falcon-7b.2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5.modelling_RW.RWForCausalLM'>
type(tokenizer)=<class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
Using pad_token, but it is not set yet.
tokenizer.pad_token=None
type(peft_config)=<class 'peft.tuners.lora.LoraConfig'>
model=RWForCausalLM(
  (transformer): RWModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x DecoderLayer(
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
        (self_attention): Attention(
          (maybe_rotary): RotaryEmbedding()
          (query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
          (dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): MLP(
          (dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
          (act): GELU(approximate='none')
          (dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
        )
      )
    )
    (ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)

---- start Print all special tokens
eos_token: <|endoftext|>
additional_special_tokens: ['>>TITLE<<', '>>ABSTRACT<<', '>>INTRODUCTION<<', '>>SUMMARY<<', '>>COMMENT<<', '>>ANSWER<<', '>>QUESTION<<', '>>DOMAIN<<', '>>PREFIX<<', '>>SUFFIX<<', '>>MIDDLE<<']

---- end Print all special tokens
model.get_input_embeddings().weight.size()=torch.Size([65024, 4544])
pad_embedding=tensor([[[-0.0179,  0.0201, -0.0273,  ..., -0.0275, -0.0396, -0.0131],
         [-0.0510, -0.0079, -0.0383,  ..., -0.0481,  0.0581,  0.0282],
         [-0.0217, -0.0216, -0.0064,  ..., -0.0508,  0.0554, -0.0013],
         ...,
         [ 0.0425,  0.0452, -0.0131,  ...,  0.0019,  0.0476,  0.0342],
         [-0.0170, -0.0085,  0.0449,  ..., -0.0074,  0.0178,  0.0043],
         [-0.0439, -0.0859, -0.0820,  ...,  0.0130,  0.0669,  0.0884]]],
       device='cuda:0', dtype=torch.float16, grad_fn=<UnsqueezeBackward0>)
Success!
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 190, in <module>
    example_test_model_already_has_pad_token()
  File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 182, in example_test_model_already_has_pad_token
    tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 1271, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 1144, in _validate_model_kwargs
    raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)

code:

    # qlora flacon7b
    from uutils.hf_uu.model_tokenizer.falcon_uu_mdl_tok import get_model_tokenizer_qlora_falcon7b
    model, tokenizer, peft_config = get_model_tokenizer_qlora_falcon7b()
    print(f'{model=}')
    sent = 'Dogs are great because they are '
    print()

    # print to see if pad tokens are present and if it ignores the tokens at the end
    # encoded_input = tokenizer(sent, padding='max_length', max_length=10, return_tensors='pt')
    # sys.exit()

    # Print all special tokens
    print('\n---- start Print all special tokens')
    for token_name, token in tokenizer.special_tokens_map.items():
        print(f"{token_name}: {token}")
    print('\n---- end Print all special tokens')

    # Get the ID for the '[PAD]' token
    try:
        pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
    except KeyError:
        raise ValueError("Token [PAD] is not present in the tokenizer vocabulary.")

    # Index into the model's embedding table
    try:
        print(f'{model.get_input_embeddings().weight.size()=}')
        pad_embedding = model.get_input_embeddings().weight[pad_token_id]
    except IndexError:
        raise ValueError(f"Token ID {pad_token_id} is not present in the model's embedding matrix.")

    print(f'{pad_embedding=}')
    print('Success!')

    # check it generates something sensible
    tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
    print('Success2!')

brando90 commented 1 year ago

I think I just need to add it to the tokenizer and the model. Since during fine-tuning/training the pad token would be ignored anyway, adding a random set of weights to the embedding table matrix wouldn't matter anyway. It wouldn't be updated anyway.

Code:

    # - Get falcon 4bit model
    # todo, where is this being saved & how to download quicker
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        quantization_config=bnb_config,
        trust_remote_code=True  # allows to execute custom code you download from the uploaded model code you are using
    )
    # this is here to save gpu vram. Likely only needed when using 40b or when oom issues happen ref: https://stackoverflow.com/questions/76633335/why-does-hugging-face-falcon-model-use-mode-config-use-cache-false-why-wouldn
    model.config.use_cache = use_cache
    print(f'{type(model)=}')

    # - Get falcon tokenizer
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
                                              trust_remote_code=True)  # execs code downloaded from hf hub
    # tokenizer.pad_token = tokenizer.eos_token  # todo: why? https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # I think this is fine if during the training pad is ignored
    model.resize_token_embeddings(len(tokenizer))  # todo: I think this is fine if during the training pad is ignored
    print(f'{type(tokenizer)=}')
    print(f'{tokenizer.pad_token=}')

So close....

Darn this still not works:

UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)

code:

""" sfttrainer (likely using peft) best practices: https://huggingface.co/docs/trl/main/en/sft_trainer#best-practices

Best practices

Pay attention to the following best practices when training a model with that trainer:

SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer. If none is passed, the trainer will retrieve that value from the tokenizer. Some tokenizers do not provide default value, so there is a check to retrieve the minimum between 2048 and that value. Make sure to check it before training.
For training adapters in 8bit, you might need to tweak the arguments of the prepare_model_for_int8_training method from PEFT, hence we advise users to use prepare_in_int8_kwargs field, or create the PeftModel outside the SFTTrainer and pass it.
For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add load_in_8bit argument when creating the SFTTrainer, or create a base model in 8bit outside the trainer and pass it.
If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to from_pretrained() method.

todo: why trust_remote_code? I want more details. """ import sys

import torch from peft import LoraConfig

from transformers.modeling_utils import PreTrainedModel

from pdb import set_trace as st

def test_bfloat16_int4(compute_dtype: torch.dtype, use_4bit, ): """ python -c "import torch; print(torch.cuda.get_device_capability());" todo: check other code test_bfloat16() do we need use_4bit? """ if compute_dtype == torch.float16 and use4bit: major, = torch.cuda.get_device_capability() if major >= 8: print("=" 80) print("Your GPU supports bfloat16, you can accelerate training with the argument --bfloat16") print("=" 80)

def get_model_tokenizer_qlora_falcon7b(

-- mode args

    # model_id = "tiiuae/falcon-7b"
    pretrained_model_name_or_path: str = "ybelkada/falcon-7b-sharded-bf16",
    use_cache: bool = True,
    # -- lora args
    lora_alpha=16,  # todo
    lora_dropout=0.1,  # todo, evidence drop out really help? google, crfm, gpt4
    lora_r=64,  # todo
    bnb_4bit_compute_dtype=torch.float16,  # changed it from Guanaco hf

    # -- training args
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    # paging so that the sudden mem gpu spikes don't cause the run to shut down
    # (I think usually caused by too long seqs)
    # todo: why 32 bit opt?
    # todo: paged nadamw opt?
    optim="paged_adamw_32bit",
    save_steps=10,
    logging_steps=10,
    learning_rate=2e-4,
    max_grad_norm=0.3,
    max_steps=500,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    # -- quant. args (not recommended to be changed unless you know what your doing?)
    load_in_4bit=True,  # load (usually huge) base model in 4 bits
    bnb_4bit_quant_type="nf4",  # normal float 4 for the (large) base models qlora

) -> tuple: """ Load the Falcon 7B model, quantize it in 4bit and attach LoRA adapters on it.

bf16 = 1S, 7Exp, 8Mantissa
hypothesis: 7b trained due to 6.7 emergence rumour, I still don't think emergence is real.
Notes:
    - ft a model is very specific to the model, tokenizer and training scheme. Thus we return
        - model, tokenizer, ft config (peft config), training args

ref:
    - https://colab.research.google.com/drive/1DOi8MFv4SWN9NImVornZ7t6BgmLoPQO-#scrollTo=AjB0WAqFSzlD
"""
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

# - Get bnb config for bit-4 base model (bnb lib for using 4bit qlora quantization techniques by tim dettmers)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=load_in_4bit,  # load (usually huge) base model in 4 bits
    bnb_4bit_quant_type=bnb_4bit_quant_type,  # normal float 4 for the (usually huge) base model
    bnb_4bit_compute_dtype=bnb_4bit_compute_dtype,  # if you can, during computation use bf16
)

# - Get falcon 4bit model
# todo, where is this being saved & how to download quicker
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    quantization_config=bnb_config,
    trust_remote_code=True  # allows to execute custom code you download from the uploaded model code you are using
)
print(f'{type(model)=}')
print(f'{model=}')
# this is here to save gpu vram. Likely only needed when using 40b or when oom issues happen ref: https://stackoverflow.com/questions/76633335/why-does-hugging-face-falcon-model-use-mode-config-use-cache-false-why-wouldn
model.config.use_cache = use_cache
print(f'{type(model)=}')

# - Get falcon tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
                                          trust_remote_code=True)  # execs code downloaded from hf hub
# tokenizer.pad_token = tokenizer.eos_token  # ref: https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token
# tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # I think this is fine if during the training pad is ignored
tokenizer.add_special_tokens({'pad_token': '<|pad|>'})  # I think this is fine if during the training pad is ignored

# - Modify model
# add pad token embed
model.resize_token_embeddings(len(tokenizer))  # todo: I think this is fine if during the training pad is ignored
model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
model.config.max_new_tokens = len(tokenizer)
# model.config.min_length = 1
print(f'{model=}')
print(f'{type(tokenizer)=}')
print(f'{tokenizer.pad_token=}')
# data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) todo

# - Get falcon lora config
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    # model card for falcon tiiuae/falcon-7b: https://huggingface.co/tiiuae/falcon-7b/blob/main/modelling_RW.py
    # does seem to include all trainable params as done by qlora on their own paper
    target_modules=[
        # word_embeddings,
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
        # "lm_head"
    ]
)
print(f'{type(peft_config)=}')

# todo: print the num params of the lora = D1*r + D2*r and num of bytes by prec. (bytes) * num params
return model, tokenizer, peft_config

-- tests

def example_test_model_already_has_pad_token(): """ if it already has pad token, it likely has a small prob, so we are done.

compare it's norm with other tokens to verify this is true.

python ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py """

- the get datasets todo: preprocessing, padding, streaming

from uutils.hf_uu.data_hf.common import get_guanaco_datsets_add_splits_train_test_only
trainset, _, testset = get_guanaco_datsets_add_splits_train_test_only()

# qlora flacon7b
from uutils.hf_uu.model_tokenizer.falcon_uu_mdl_tok import get_model_tokenizer_qlora_falcon7b
model, tokenizer, peft_config = get_model_tokenizer_qlora_falcon7b()
model: PreTrainedModel = model
print(f'{model=}')
sent = 'Dogs are great because they are '
print()

# print to see if pad tokens are present and if it ignores the tokens at the end
encoded_input = tokenizer(sent, padding='max_length', max_length=10, return_tensors='pt')
print(f'{encoded_input=}')

# Print all special tokens
print('\n---- start Print all special tokens')
for token_name, token in tokenizer.special_tokens_map.items():
    print(f"{token_name}: {token}")
print('\n---- end Print all special tokens')

# Get the ID for the '[PAD]' token
try:
    pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
except KeyError:
    raise ValueError("Token [PAD] is not present in the tokenizer vocabulary.")

# Index into the model's embedding table
try:
    print(f'{model.get_input_embeddings().weight.size()=}')
    pad_embedding = model.get_input_embeddings().weight[pad_token_id]
except IndexError:
    raise ValueError(f"Token ID {pad_token_id} is not present in the model's embedding matrix.")

print(f'{pad_embedding=}')
print('Success!\n')

# check it generates something sensible
# tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
input_ids, attention_mask = encoded_input['input_ids'], encoded_input['attention_mask']
predicted_tokens_ids_options = model.generate(input_ids=input_ids, attention_mask=attention_mask, do_sample=True)
predicted_tokens_ids = predicted_tokens_ids_options[0]
predicted_sent = tokenizer.decode(predicted_tokens_ids)
print(f'original sentence: {sent=}')
print(f'predicted sentence: {predicted_sent=}')
print('Success2!')

if name == 'main': import time

start_time = time.time()
example_test_model_already_has_pad_token()
print(f"The main function executed in {time.time() - start_time} seconds.\a")

it doesn't like the modifications to the model:

model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
model.config.max_new_tokens = len(tokenizer)

ArthurZucker commented 1 year ago

Hey @brando90 ! Thanks a lot for reporting and using transformers. This particular thread is not exactly the good place to have such huge chunks of codes and talk about another issue. My best recommendation is:

create a colab with your code, make it minimaly reproducible. Use small models so that it's faster for everyone who wants to take a look 🚀 !
share your colab and issue on the hugging face forum : https://discuss.huggingface.co/. If you don't get an answer form the community, try to ping me or anyone from the team!
properly format the part of your code. In this case the previous message is pretty much unreadable! Would love to help you make this work, but make sure you convey in a good format!
summarise your issue! (A tokenizer not having a pad tokens is pretty common, GPT2 was pretty much the same. When training, inputs can often be truncated rather than padded, to have as much information as possible).
check the documentation 📖 ! Especially regarding how to modify generation parameters such as pad_token, max_new_tokens etc . You should have a look here. This will remove the warning that you were seeing.

Reading the post you created on the HF forum, you mention

it doesn’t like the modifications to the model:

But since there is no traceback, this is very vague! A colab will show the outputs you got, making it easier to understand. Also regarding padding token and not padding token, I believe this is a very important question and if we should review how we resize the embedding, so be it! Some model's embedding are usually always bigger than the length of the tokenizer to allow adding new tokens / be a power of X to make it faster.

brando90 commented 1 year ago

https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token

robertheessels commented 1 year ago

As a temporary fix I was able to accomplish the inference (of a Falcon 7b training) stopping correctly like this:

In each row of my training data, at the end I added "*****" (without the quotes), which encoded into one token: 39735
Then I do the normal training ( just using tokenizer.pad_token = tokenizer.eos_token)
And in the inference run I set eos_token_id=39735

This makes the inference generate token ***** at the end of the answer (because it is in all the training examples), at which point it will stop because it is set as the ending token.

    output_tokens = model.generate(
        input_ids = batch.input_ids, 
        max_new_tokens=100,
        temperature=0.001,
        top_p=0.7,
        num_return_sequences=1,
        pad_token_id=39735, # *****
        eos_token_id=39735, # *****
    )

robertheessels commented 1 year ago

Finally found the correct way to do this here: https://georgesung.github.io/ai/qlora-ift/

You need to do tokenizer.add_special_tokens({'pad_token': '[PAD]'}) instead of tokenizer.pad_token = tokenizer.eos_token

And you need to add the tokenizer.eos_token at the end of EACH training example.

sadransh commented 1 year ago

in my case for some reason eos_token_id and ... was not being added to model.generate configs

ArthurZucker commented 1 year ago

If you want help feel free to open an issue with more details 😉

hgfriver commented 1 year ago

@robertheessels your answer solved my problem. you save my life. thank you so much!!!

You need to do tokenizer.add_special_tokens({'pad_token': '[PAD]'}) instead of tokenizer.pad_token = tokenizer.eos_token

And you need to add the tokenizer.eos_token at the end of EACH training example.

ArthurZucker commented 1 year ago

Adding the eos_token at the end of each training example can be activated using

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", add_eos_token = True)

Or simply:

>>> tokenizer.add_eos_token = True

dtthanh1971 commented 1 year ago

When I setup:

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

My code is crashed every time I run on T4 Colab

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map 
)
# model.to(device)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# For LLaMA models, the default tokenizer does not specify a pad token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
#Adding the eos_token: </s> at the end of each training example
#tokenizer.add_eos_token = True

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# Save trained model
trainer.model.save_pretrained(new_model)

https://drive.google.com/file/d/1l6HsYH8iHgEA8H3jHMwgcMgA5_OdTbmh/view?usp=sharing

ArthurZucker commented 1 year ago

Hey, please open a different issue as this is not related! Also make sure that you can run nvidia-smi and that it shows the gpu, otherwise you might just not have set the colab to use a GPU instance.

mjyh commented 1 year ago

@dtthanh1971 Your issue may be because len(tokenizer) != model.vocab_size, i.e. len(tokenizer) == model.vocab_size + 1. That was my experience. See Kumar Saurabh's answer here: https://stackoverflow.com/questions/76633368/how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoi

LazerJesus commented 1 year ago

I am running into the same issue on stablecode-instruct-alpha-3b which is gptx neo based. whats the recommended approach in this case? adding [pad] token results in issues during training. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [94,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.

ArthurZucker commented 1 year ago

@LazerJesus you probably did not resize your embedding matrix. Run on CPU to have a better idea of where the issue arise, and open a seperate issue if there is indeed a bug. Also could you make sure to check other issues or the forum, as this is quite a common problem ! (meaning has been reported a lot)

alejandrofdzllorente commented 1 year ago

I solved it by changing AutoTokenizer to LlamaTokenizer to force to use the slow tokenizer and not the fast tokenizer that is automatically imported, I lost some functions but it works.

adi commented 1 year ago

I solved it by changing AutoTokenizer to LlamaTokenizer to force to use the slow tokenizer and not the fast tokenizer that is automatically imported, I lost some functions but it works.

This worked for me, too.

ArthurZucker commented 1 year ago

You can force the usage of slow tokenizer by setting use_fast = False when loading with AutoTokenizer
The outputs (adding a special tokens) should be the same. If they are not then this is an issue for us. However on main, if you use tokenizer("text", add_special_tokens= True) for both fast and slow you should have the same results. If not, and you are on main, feel free to open a new issue 😉

SuperBruceJia commented 11 months ago

@brando90

But this assumes the model has a pad_token

I haven't confirmed, but I think tokenizer.add_special_tokens({'pad_token': '[PAD]'}) is equivalent to tokenizer.pad_token = '[PAD]' (edit: might be wrong about that). So if there are runtime errors with tokenizer.add_special_tokens({'pad_token': '[PAD]'}) then there would also be runtime errors with tokenizer.pad_token = tokenizer.eos_token -- note tokenizer.eos_token is just a string. But I observed runtime errors with neither. I just observed that when I set tokenizer.pad_token = tokenizer.eos_token during training, the model won't stop generating during inference, since it was trained to not output the eos token (per discussions above).

Since I was working with open_llama_7b, I saw that even though the model's tokenizer didn't specify a pad token string in its tokenizer_config.json, it still had a row in its token embedding matrix for the pad token. If you run print(model), you can see its token embedding matrix has an index reserved for the pad token (idx 0 in this case):
> print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
..
You can also see the pad token's embedding itself: model.state_dict()['model.embed_tokens.weight'][0]. Although from discussions above and also this discussion, it doesn't seem to matter what the actual embeddings are for the pad token.

It works well when I use the following codes:

DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_UNK_TOKEN = "<unk>"

tokenizer = AutoTokenizer.from_pretrained(
   "meta-llama/Llama-2-7b-hf",
    return_tensors="pt",
    model_max_length=512,
    add_eos_token=True,
    add_bos_token=True,
    padding='longest',
    padding_side="right",
    use_fast=False,
    trust_remote_code=True,
    # use_auth_token=use_auth_token,
    device_map="auto",
)

tokenizer.add_special_tokens(
        {
            "pad_token": DEFAULT_PAD_TOKEN,
            "eos_token": DEFAULT_EOS_TOKEN,
            "bos_token": DEFAULT_BOS_TOKEN,
            "unk_token": DEFAULT_UNK_TOKEN,
        }
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Best regards,

Shuyue Nov. 27th, 2023

SuperBruceJia commented 11 months ago

I have one question about adding these special tokens:

A warning occurred:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

After we add these special tokens, we resize input token embeddings matrix of the model:

# Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer))

So, one problem here is how could we know whether these newly added embeddings are fine-tuned.

Thank you very much!

Best regards,

Shuyue Nov. 27th, 2023

ArthurZucker commented 11 months ago

Of course they are not if the size of the matrix / these tokens are new. The warning is more general than it seems, but if you add new special tokens, they were not part of the vocab before thus not seen.

SuperBruceJia commented 11 months ago

Of course they are not if the size of the matrix / these tokens are new. The warning is more general than it seems, but if you add new special tokens, they were not part of the vocab before thus not seen.

Dear Arthur,

Thank you very much for you reply!

I also have one question about training the model. In practice, what kind of length should we set when we train the model? Should we use max length of the dataset, or averaged length? Should we use group_by_length? Or, should we pad all the sequences into the same length, for example, the maximum length in a batch, or, should we just pad all the sequences in the dataset into max length? Which is the most effective in practice?

Thank you very much, and have a nice day!

Best regards,

Shuyue Dec. 2nd, 2023

ArthurZucker commented 11 months ago

Not sure at all about this, but you can ask on the forum instead!

Thanks!

huggingface / transformers

LLaMA FastTokenizer does not add `eos_token_id` at the end. #22794

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

-- mode args

-- tests

- the get datasets todo: preprocessing, padding, streaming