Longformer EncoderDecoder (LED)-Large model finetuning for summarization results in </s><s><s><s><s><s><s><s><s><s><s>... output

ratishsp commented 2 years ago

System Info

transformers version: 4.20.0.dev0
Platform: Linux-4.18.0-348.23.1.el8_5.x86_64-x86_64-with-centos-8.6-Green_Obsidian
Python version: 3.7.13
Huggingface_hub version: 0.7.0
PyTorch version (GPU?): 1.11.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@ydshieh

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

OUTPUT_DIR=/home/ratish/project
python -m torch.distributed.launch --nproc_per_node=1 examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path allenai/led-large-16384 \
    --do_train \
    --do_eval \
    --dataset_name xsum \
    --output_dir ${OUTPUT_DIR} \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --overwrite_output_dir \
    --logging_dir logs \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --logging_steps 100 \
    --report_to tensorboard \
    --save_total_limit 5 \
    --save_steps 100 \
    --load_best_model_at_end \
    --greater_is_better True \
    --metric_for_best_model rougeL \
    --max_eval_samples 100 \
    --num_beams 3

The logs shows that at checkpoint 1800 the rouge becomes zero. {'eval_loss': 2.172360897064209, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 20.0, 'eval_runtime': 10.2823, 'eval_samples_per_second': 9.725, 'eval_steps_per_second': 2.431, 'epoch': 0.04}

I evaluate the model output using the below function:

def generate_output():
    import torch
    from transformers import LEDTokenizer, LEDForConditionalGeneration
    MODEL="/home/ratish/checkpoint-1800"
    model = LEDForConditionalGeneration.from_pretrained(MODEL)
    tokenizer = LEDTokenizer.from_pretrained(MODEL)
    ARTICLE_TO_SUMMARIZE = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
    inputs = tokenizer.encode(ARTICLE_TO_SUMMARIZE, return_tensors="pt")
    global_attention_mask = torch.zeros_like(inputs)
    global_attention_mask[:, 0] = 1
    summary_ids = model.generate(inputs, global_attention_mask=global_attention_mask, num_beams=3, max_length=32)
    print(tokenizer.decode(summary_ids[0], skip_special_tokens=False, clean_up_tokenization_spaces=False))

It produces the output </s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>

Expected behavior

The model should produce the summary of the news article.

ydshieh commented 2 years ago

Hi @ratishsp . Thanks for reporting, I will take a look. Do you have (some) results from the previous checkpoints? Do they have better rouge scores and a bit meaningful outputs than checkpoint 1800?

ratishsp commented 2 years ago

Hi @ydshieh thanks for looking into the issue. In a previous checkpoint 1500, the model produced a good output for the above news article: </s><s>The Eiffel Tower is the tallest building in the world, with a height of 300 metres (1,063 ft).</s>

ratishsp commented 2 years ago

What is surprising is that the eval rouge fluctuates a lot till checkpoint 1500, after which it remains close to 0. I have attached below a tensorboard image of eval_rouge1

ratishsp commented 2 years ago

Even more suprising, LED-Base model seems to be doing quite well!

Model output (checkpoint 1600): </s><s>The Eiffel Tower in Paris is the tallest structure in the world.</s>

bibhabasumohapatra commented 2 years ago

Actually I checked the output of base models... Was really quite good. Better if increase max_length Like 64/ ...128

zhangzx-uiuc commented 2 years ago

I had the same issue. allenai/led-base-16384 works well but allenai/led-large-16384 and allenai/PRIMERA simply generates "" after about a few hundreds steps of training.

zhangzx-uiuc commented 2 years ago

I assume that it is an error in the generate method, since the training loss curves for the base and large models look really similar and both of them are reasonable.

ratishsp commented 2 years ago

Hi @ydshieh, checking if you were able to look into the issue.

ydshieh commented 2 years ago

Hi, @ratishsp I will look this issue this week :-) hope I can have some insight!

ydshieh commented 2 years ago

Hi, @ratishsp I haven't running the script myself, but I see something already.

You mentioned you use examples/pytorch/summarization/run_summarization.py. That file is a general training script. However, LEDModel/LEDForConditionalGeneration is somehow special: it uses global_attention_mask.

As you are running summarization, it is LEDForConditionalGeneration. For this model, we should put 1 for the global_attention_mask on the first token <s> in the encoder input sequence.

doc: search For summarization, it is advised to put.
model card

In fact, in your inference code snippet, you also have it:

    global_attention_mask = torch.zeros_like(inputs)
    global_attention_mask[:, 0] = 1

So (one of) the problem(s) must come from the fact that you don't include global_attention_mask in your training script. It should be fairly to add it. But you can also check this notebook by my colleague @patrickvonplaten (I believe he is the author of this notebook).

Let me know if you get desired results once you train with global attention!

(I am surprised the base model works fine however)

ratishsp commented 2 years ago

Hi @ydshieh I had missed to mention this in the original issue description. I had experimented with setting the global attention mask during training. But it didn't change the outcome.

ydshieh commented 2 years ago

Would you like to share you entire code, so we can avoid the difference between your code and mine :-) (the one you have with global attention)

ratishsp commented 2 years ago

I had added the line model_inputs["global_attention_mask"] = [[1 if y == tokenizer.cls_token_id else 0 for y in x] for x in model_inputs["input_ids"]] into the code after https://github.com/huggingface/transformers/blob/0d0aada56444ad554021947addaa035feb55948f/examples/pytorch/summarization/run_summarization.py#L536

ydshieh commented 2 years ago

Hi @ratishsp After a long investigation, although not fully understanding the model behavior, here is the observation

led-large (without further finetuning) will produce the same LM logits for [2, 0], i.e. the tokens [<eos>, <bos>] (or say [</s>, <s>]), no matter what the encoder input sequences are (at least for xsum datasets), and therefore the same predicted token ids. I provide the script to confirm this below, and the results in the next 2 comments. The results for led-large is here.

During training however, <eos> is required to predict the label <bos>, and <bos> is required to predict the first non-special tokens in a sentence. Since they have the same logits, it causes the training difficulty , and ends up learning

<eos> --> <bos>
<bos> --> <bos>

(as both have the same predicted logits).

There is one related discussion here. The solution is to perturb the representation of bos_token. I haven't tried it yet, but it makes sense to me.

However, why led-large (or say, bart-large) has this issue is still mysterious to me!

To verify

To have more information printed

git fetch https://github.com/ydshieh/transformers.git check_gen:check_gen
git checkout check_gen

Run this script (inside /examples/pytorch/summarization/)

import numpy as np
import torch

from transformers import AutoTokenizer
from transformers import LEDModel, LEDForConditionalGeneration

import datasets

summarization_name_mapping = {
    "cnn_dailymail": ("article", "highlights"),
    "xsum": ("document", "summary"),
}

ckpt_led_base = "allenai/led-base-16384"
ckpt_led_large = "allenai/led-large-16384"

tokenizer = AutoTokenizer.from_pretrained(ckpt_led_base)
model = LEDForConditionalGeneration.from_pretrained(ckpt_led_base)

def get_dataset(dataset_name):

    max_source_length = 1024
    max_target_length = 128
    padding = True
    ignore_pad_token_for_loss = True
    padding = "max_length"
    prefix = ""
    max_train_samples = 1024
    max_eval_samples = 256
    preprocessing_num_workers = 8

    raw_datasets = datasets.load_dataset(dataset_name)

    text_column, summary_column = summarization_name_mapping[dataset_name]

    def foo(x):

        if x == tokenizer.cls_token_id:
            return 1
        elif x == tokenizer.pad_token_id:
            return -1
        else:
            return 0

    def preprocess_function(examples):
        # remove pairs where at least one record is None

        inputs, targets = [], []
        for i in range(len(examples[text_column])):
            if examples[text_column][i] and examples[summary_column][i]:
                inputs.append(examples[text_column][i])
                targets.append(examples[summary_column][i])

        inputs = [prefix + inp for inp in inputs]
        model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

        # Tokenize targets with the `text_target` keyword argument
        labels = tokenizer(text_target=targets, max_length=max_target_length, padding=padding, truncation=True)

        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
        # padding in the loss.
        if padding == "max_length" and ignore_pad_token_for_loss:
            labels["input_ids"] = [
                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
            ]

        model_inputs["labels"] = labels["input_ids"]

        if model.__class__.__name__.startswith("LED"):
            model_inputs["global_attention_mask"] = [[foo(y) for y in x] for x in model_inputs["input_ids"]]

        decoder_input_ids = model.prepare_decoder_input_ids_from_labels(labels=torch.tensor(model_inputs["labels"], dtype=torch.int32))
        decoder_input_ids = decoder_input_ids.numpy().tolist()
        model_inputs["decoder_input_ids"] = decoder_input_ids

        return model_inputs

    train_dataset = raw_datasets["train"]
    eval_dataset = raw_datasets["validation"]

    train_dataset = train_dataset.select(range(max_train_samples))
    eval_dataset = eval_dataset.select(range(max_eval_samples))

    train_dataset = train_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=preprocessing_num_workers,
        remove_columns=['document', 'summary', 'id'],
        desc="Running tokenizer on train dataset",
    )
    eval_dataset = eval_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=preprocessing_num_workers,
        remove_columns=['document', 'summary', 'id'],
        desc="Running tokenizer on validation dataset",
    )

    return train_dataset, eval_dataset

train_dataset, eval_dataset = get_dataset("xsum")
for idx, eval_example in enumerate(eval_dataset):

    eval_example.pop("labels")

    decoder_input_ids = eval_example.pop("decoder_input_ids")
    eval_example["decoder_input_ids"] = [2, 0] + decoder_input_ids[2:5]

    for k in eval_example:
        eval_example[k] = torch.tensor([eval_example[k]], dtype=torch.int32)

    model.led.decoder.buffer = {}
    output = model(**eval_example)

    print(f"example idx: {idx}")

    for k in model.led.decoder.buffer:
        h = model.led.decoder.buffer[k]
        if not isinstance(h, dict):
            pass
            # print(f'max diff in {k}: {np.amax(np.abs((h[0, 0] - h[0, 1]).detach().to("cpu").numpy()))}')
        else:
            layer_idx = k
            buffer = h
            for name in buffer:
                h = buffer[name]
                #print(f'layer {layer_idx} - {name}: max <eos> = {torch.max(torch.abs(h[0, 0]))}')
                #print(f'layer {layer_idx} - {name}: max <bos> = {torch.max(torch.abs(h[0, 1]))}')
                #print(f'layer {layer_idx} - {name}: max <eos> dim = {torch.argmax(torch.abs(h[0, 0]), dim=-1)}')
                #print(f'layer {layer_idx} - {name}: max <bos> dim = {torch.argmax(torch.abs(h[0, 1]), dim=-1)}')
                #top = torch.topk(torch.abs(h[0, 0]), k=8, dim=-1, largest=True, sorted=True)
                #print(f'layer {layer_idx} - {name}: top <eos> indices = {top.indices}')
                #print(f'layer {layer_idx} - {name}: top <eos> values = {top.values}')
                #print(f'layer {layer_idx} - {name}: var <eos> = {torch.var(h[0, 0], unbiased=False)}')
                #print(f'layer {layer_idx} - {name}: var <bos> = {torch.var(h[0, 1], unbiased=False)}')
                if "hidden_states: ffn: final_layer_norm" in name:
                    print(f'max diff in layer {layer_idx} - {name}: {np.amax(np.abs((h[0, 0] - h[0, 1]).detach().to("cpu").numpy()))}')
                    print(f"-" * 20)

    print(f'max diff in lm logits: {np.amax(np.abs((output.logits[0, 0] - output.logits[0, 1]).detach().to("cpu").numpy()))}')
    print(f"-" * 20)

    pred = torch.argmax(output.logits, dim=-1).detach().to("cpu").numpy().tolist()
    print(f'predidcted token ids: {pred}')

    print(f"=" * 40)

    if idx >= 10:
        break

ydshieh commented 2 years ago

For led-large: note the difference is the maximal value of the absolute value of the hidden states between the 0-th position and 1-th position. More precisely: np.amax(np.abs(h[0, 0] - h[0, 1]).

As you can see, no matter what the encoder input sequences are, the difference becomes really small along the layer depth.

example idx: 0
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 0.029722318053245544
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 0.0003014765679836273
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 9.097158908843994e-06
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 2.812594175338745e-07
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 4.470348358154297e-08
max diff in layer 6 - hidden_states: ffn: final_layer_norm: 1.7881393432617188e-07
max diff in layer 7 - hidden_states: ffn: final_layer_norm: 2.384185791015625e-07
max diff in layer 8 - hidden_states: ffn: final_layer_norm: 3.725290298461914e-09
max diff in layer 9 - hidden_states: ffn: final_layer_norm: 2.9802322387695312e-08
max diff in layer 10 - hidden_states: ffn: final_layer_norm: 1.4901161193847656e-08
max diff in layer 11 - hidden_states: ffn: final_layer_norm: 1.1920928955078125e-06
max diff in lm logits: 6.67572021484375e-06
predidcted token ids: [[133, 133, 4913, 815, 19931]]
========================================
example idx: 1
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 0.02129286527633667
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 0.0002829432487487793
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.203089237213135e-06
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 2.6635825634002686e-07
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 4.470348358154297e-08
max diff in layer 6 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 7 - hidden_states: ffn: final_layer_norm: 2.384185791015625e-07
max diff in layer 8 - hidden_states: ffn: final_layer_norm: 4.76837158203125e-07
max diff in layer 9 - hidden_states: ffn: final_layer_norm: 2.9802322387695312e-08
max diff in layer 10 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 11 - hidden_states: ffn: final_layer_norm: 3.814697265625e-06
max diff in lm logits: 1.0013580322265625e-05
predidcted token ids: [[448, 448, 40741, 3463, 1034]]
========================================
example idx: 2
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 0.015403840690851212
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 0.000291973352432251
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 9.2238187789917e-06
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 4.172325134277344e-07
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 2.9802322387695312e-08
max diff in layer 6 - hidden_states: ffn: final_layer_norm: 1.1920928955078125e-07
max diff in layer 7 - hidden_states: ffn: final_layer_norm: 7.450580596923828e-09
max diff in layer 8 - hidden_states: ffn: final_layer_norm: 3.725290298461914e-09
max diff in layer 9 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 10 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 11 - hidden_states: ffn: final_layer_norm: 4.76837158203125e-06
max diff in lm logits: 1.1444091796875e-05
predidcted token ids: [[0, 0, 385, 9, 6912]]
========================================

ydshieh commented 2 years ago

For led-base.

Note that lm_logits have a significant difference in the range [20, 30].

max diff in layer 0 - hidden_states: ffn: final_layer_norm: 9.92125129699707
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 6.954092502593994
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.275293350219727
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 13.49088191986084
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 4.469869613647461
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 29.27507972717285
max diff in lm logits: 26.215885162353516
predidcted token ids: [[0, 133, 12, 815, 5142]]
========================================
example idx: 1
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 9.919170379638672
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 6.953605651855469
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.259047508239746
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 13.197162628173828
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 4.224005699157715
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 29.185691833496094
max diff in lm logits: 28.350433349609375
predidcted token ids: [[0, 846, 40741, 3463, 3449]]
========================================
example idx: 2
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 9.921760559082031
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 6.953545570373535
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.30044937133789
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 13.065882682800293
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 3.919126510620117
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 28.759159088134766
max diff in lm logits: 26.200252532958984
predidcted token ids: [[0, 35731, 385, 9, 6912]]
========================================

patrickvonplaten commented 2 years ago

Hmm that's very interesting. A couple of pointers that might help:

bart-large always forces the second token to be the BOS token during generation (see https://huggingface.co/facebook/bart-large/blob/main/config.json#L27) where as led-large doesn't. However led-large should probably do this as well since led-large is based of bart-large
IIRC led-large has exactly the same weights as bart-large. The only difference is that led-large has some additionally randomely initialized layers for the global attention
It might help to look into the original training script to see how led was fine-tuned for summarization: https://github.com/allenai/longformer/blob/master/scripts/summarization.py

Also @ibeltagy - have you seen something like the above already by any chance?

patrickvonplaten commented 2 years ago

Also one last comment, note that just because "</s> <s>" always predicts the same token regardless of the encoder outputs doesn't mean training is necessarily broken. During training all decoder_input_ids start with </s><s> and then the model should learn the correct behavior, but it might indeed be a good idea to perturb the bos token.

In general, I wouldn't recommend using both </s> and <s> as prompt tokens for the decoder_input_ids but that's how fairseq has done it with BART

ydshieh commented 2 years ago

For the record: bart-large seems learned to predict the first token after <s> in the encoder input sequence, for both the first two decoder tokens [</s>, <s>]. I provide a script to confirm this in [this comment].(https://github.com/huggingface/transformers/issues/15559#issuecomment-1217894635).

For led-large-16384, same situation. But when this is not the case, it gives [<s>, <s>]. This happens quite often, and I think it explains why we get [</s>, <s>, <s>, <s>, ...] after finetuning.

ydshieh commented 2 years ago

@ratishsp

I could confirm that the trick of perturbing the bos token's embedding works for led-large-16384. You can simply adding the following block after the line https://github.com/huggingface/transformers/blob/49e44b216b2559e34e945d5dcdbbe2238859e29b/examples/pytorch/summarization/run_summarization.py#L425 would work.

Please let us know if this works for you!

Here is the code to add:

    import torch
    from transformers.modeling_utils import _load_state_dict_into_model

    d = model.state_dict()
    d["led.decoder.embed_tokens.weight"][0] = d["led.decoder.embed_tokens.weight"][0] + torch.randn(1024)

    _load_state_dict_into_model(model, d, "led.")

ydshieh commented 2 years ago

Hi @ratishsp Hope the above solution works for you. I am going to close this issue, but if you have further question, don't hesitate to reopen.

ratishsp commented 2 years ago

Hi @ydshieh, sorry for the late reply... I had got busy with other stuff. I tried the above fix of perturbing the weights for bos. But it didn't work for me.

ydshieh commented 2 years ago

@ratishsp Sorry to hear that, I am not sure what I can help further here, as the issue is found and a fix is provided which worked on my side (and some other users previously).

If you can open a new working branch, add your fix there and share it with us + with the list of training arguments used in your latest attempt, we could try to find some time to see if there are other things go wrong there.

ratishsp commented 2 years ago

Hi @ydshieh I have followed an identical setup as mentioned at the beginning of the thread but with latest version of Transformers repo. Sure, I can open a branch, add a fix and share with you. Meanwhile, will it be possible for you to share tensorboard log of your run similar to the one here https://github.com/huggingface/transformers/issues/18190#issuecomment-1189139463?

ydshieh commented 2 years ago

Hi @ratishsp . If you ever try to run it again with a branch that is aimed to share with us, there 2 two fixes to take into account:

https://github.com/huggingface/transformers/issues/18190#issuecomment-1210958506 https://github.com/huggingface/transformers/issues/18190#issuecomment-1218408325

I also strongly suggest that you manually investigate if the bos token embedding is changed before and after this (newly added) line

 _load_state_dict_into_model(model, d, "led.")

I didn't keep the training log - I tried the fix with a training up to around 2K (or 3K maybe) steps, and didn't see this </s><s><s>... anymore (while I tried without the fix, it did occur as you described)

Once you have the code (with the fixes mentioned above that you will add), we can see if there is some mistake. And if you still get </s><s><s>..., I will try to run it myself. (BTW, I won't be available next week).

ratishsp commented 2 years ago

Hi @ydshieh, I have created a branch with fixes at https://github.com/ratishsp/transformers-fix. I trained two models: LED-Base and LED-Large with the identical code. The training commands are the same as given earlier in the thread https://github.com/huggingface/transformers/issues/18190#issue-1308379298. Below tensorboard logs show that the issue still exists.

ydshieh commented 2 years ago

Thanks @ratishsp . Will take a look once I am back!

ydshieh commented 2 years ago

Hi @ratishsp

As promised, I checked. You are right, perturbing bos token embedding is not helping for the checkpoint allenai/led-large-16384. (well, it helps a bit at the first few iterations, but once the steps continue, we get the same </s><s><s>.)

I ran out of the ideas, the only thing works is to avoid using </s> <s> <tok_1> <tok_2> ... when preparing labels. Instead, just using </s> <tok_1> <tok_2> .... To do so, add the following block after the line https://github.com/huggingface/transformers/blob/4dd784c32f76fb8285f205b94e2a6ebde731a1cd/examples/pytorch/summarization/run_summarization.py#L536

To add

        # Originally, the `labels` are of the form: </s> <s> ..., which causes trouble for finetuning some checkpoints.
        # Let's try to remove <s> (`bos` token) in `labels`, i.e. keep only the decoder_start_token (here </s>).

        model_inputs["labels"] = [x[1:] for x in model_inputs["labels"]]

Or you can simplify using my branch debug_led_large_bad_generation - this will save the generations after each evaluation.

You can verify the effect with (and without) this change by running a tiny training (with very few examples) below:

./run_summarization.py \
    --model_name_or_path allenai/led-large-16384 \
    --dataset_name xsum \
    --output_dir ./led-large-16384-xsum-no-bos-dummy-1 \
    --overwrite_output_dir \
    --logging_dir ./led-large-16384-xsum-no-bos-dummy-logs-1 \
    --do_train \
    --do_eval \
    --predict_with_generate \
    --report_to tensorboard \
    --load_best_model_at_end \
    --greater_is_better True \
    --metric_for_best_model rougeL \
    --per_device_train_batch_size=1 \
    --per_device_eval_batch_size=4 \
    --evaluation_strategy steps \
    --max_steps 500 \
    --max_train_samples 500 \
    --max_eval_samples 100 \
    --logging_steps 100 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 10 \
    --generation_max_length 128 \
    --num_beams 3

Let me know if you can get normal results with this change 🙏 Thank you!

ratishsp commented 2 years ago

Hi @ydshieh, it works! Thanks.

ydshieh commented 2 years ago

@ratishsp I am super glad it also works for you 🤗 !

I will discuss with my colleagues where to put this information in our documentation, so there will be more clear reference to this issue and workaround.

customer101 commented 1 year ago

Hi @ydshieh ,

I'm facing the same problem but for another model, here is a link to the issue,

The finetuning works fine, and the loss is decreasing as expected, but the model doesn't generate any sequences, is there a way to modify the generation logic to get something out, without re-finetuning the model?

ydshieh commented 1 year ago

I answered in your post on the forum, but just in case

Hi @customer101 , could you provide a script (or the command you used to launch the training) that could reproduce the issue, please?

If you want to proceed on GitHub , it's better to open a new issue instead of in this thread. Thank you.

huggingface / transformers