Closed ratishsp closed 2 years ago
Hi @ratishsp . Thanks for reporting, I will take a look. Do you have (some) results from the previous checkpoints? Do they have better rouge scores and a bit meaningful outputs than checkpoint 1800?
Hi @ydshieh thanks for looking into the issue. In a previous checkpoint 1500, the model produced a good output for the above news article: </s><s>The Eiffel Tower is the tallest building in the world, with a height of 300 metres (1,063 ft).</s>
What is surprising is that the eval rouge fluctuates a lot till checkpoint 1500, after which it remains close to 0. I have attached below a tensorboard image of eval_rouge1
Even more suprising, LED-Base model seems to be doing quite well!
Model output (checkpoint 1600):
</s><s>The Eiffel Tower in Paris is the tallest structure in the world.</s>
Actually I checked the output of base models... Was really quite good. Better if increase max_length Like 64/ ...128
I had the same issue. allenai/led-base-16384
works well but allenai/led-large-16384
and allenai/PRIMERA
simply generates ""
after about a few hundreds steps of training.
I assume that it is an error in the generate
method, since the training loss curves for the base
and large
models look really similar and both of them are reasonable.
Hi @ydshieh, checking if you were able to look into the issue.
Hi, @ratishsp I will look this issue this week :-) hope I can have some insight!
Hi, @ratishsp I haven't running the script myself, but I see something already.
You mentioned you use examples/pytorch/summarization/run_summarization.py
. That file is a general training script.
However, LEDModel/LEDForConditionalGeneration
is somehow special: it uses global_attention_mask
.
As you are running summarization, it is LEDForConditionalGeneration
. For this model, we should put 1
for the global_attention_mask
on the first token <s>
in the encoder input sequence.
For summarization, it is advised to put
.In fact, in your inference code snippet, you also have it:
global_attention_mask = torch.zeros_like(inputs)
global_attention_mask[:, 0] = 1
So (one of) the problem(s) must come from the fact that you don't include global_attention_mask
in your training script. It should be fairly to add it. But you can also check this notebook by my colleague @patrickvonplaten (I believe he is the author of this notebook).
Let me know if you get desired results once you train with global attention!
(I am surprised the base model works fine however)
Hi @ydshieh I had missed to mention this in the original issue description. I had experimented with setting the global attention mask during training. But it didn't change the outcome.
Would you like to share you entire code, so we can avoid the difference between your code and mine :-) (the one you have with global attention)
I had added the line model_inputs["global_attention_mask"] = [[1 if y == tokenizer.cls_token_id else 0 for y in x] for x in model_inputs["input_ids"]]
into the code after https://github.com/huggingface/transformers/blob/0d0aada56444ad554021947addaa035feb55948f/examples/pytorch/summarization/run_summarization.py#L536
Hi @ratishsp After a long investigation, although not fully understanding the model behavior, here is the observation
led-large
(without further finetuning) will produce the same LM logits for [2, 0]
, i.e. the tokens [<eos>, <bos>]
(or say [</s>, <s>]
), no matter what the encoder input sequences are (at least for xsum
datasets), and therefore the same predicted token ids. I provide the script to confirm this below, and the results in the next 2 comments. The results for led-large
is here.
During training however, <eos>
is required to predict the label <bos>
, and <bos>
is required to predict the first non-special tokens in a sentence. Since they have the same logits, it causes the training difficulty , and ends up learning
<eos> --> <bos>
<bos> --> <bos>
(as both have the same predicted logits).
There is one related discussion here. The solution is to perturb the representation of bos_token
. I haven't tried it yet, but it makes sense to me.
However, why led-large
(or say, bart-large
) has this issue is still mysterious to me!
To have more information printed
git fetch https://github.com/ydshieh/transformers.git check_gen:check_gen
git checkout check_gen
Run this script (inside /examples/pytorch/summarization/
)
import numpy as np
import torch
from transformers import AutoTokenizer
from transformers import LEDModel, LEDForConditionalGeneration
import datasets
summarization_name_mapping = {
"cnn_dailymail": ("article", "highlights"),
"xsum": ("document", "summary"),
}
ckpt_led_base = "allenai/led-base-16384"
ckpt_led_large = "allenai/led-large-16384"
tokenizer = AutoTokenizer.from_pretrained(ckpt_led_base)
model = LEDForConditionalGeneration.from_pretrained(ckpt_led_base)
def get_dataset(dataset_name):
max_source_length = 1024
max_target_length = 128
padding = True
ignore_pad_token_for_loss = True
padding = "max_length"
prefix = ""
max_train_samples = 1024
max_eval_samples = 256
preprocessing_num_workers = 8
raw_datasets = datasets.load_dataset(dataset_name)
text_column, summary_column = summarization_name_mapping[dataset_name]
def foo(x):
if x == tokenizer.cls_token_id:
return 1
elif x == tokenizer.pad_token_id:
return -1
else:
return 0
def preprocess_function(examples):
# remove pairs where at least one record is None
inputs, targets = [], []
for i in range(len(examples[text_column])):
if examples[text_column][i] and examples[summary_column][i]:
inputs.append(examples[text_column][i])
targets.append(examples[summary_column][i])
inputs = [prefix + inp for inp in inputs]
model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
# Tokenize targets with the `text_target` keyword argument
labels = tokenizer(text_target=targets, max_length=max_target_length, padding=padding, truncation=True)
# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
if padding == "max_length" and ignore_pad_token_for_loss:
labels["input_ids"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
if model.__class__.__name__.startswith("LED"):
model_inputs["global_attention_mask"] = [[foo(y) for y in x] for x in model_inputs["input_ids"]]
decoder_input_ids = model.prepare_decoder_input_ids_from_labels(labels=torch.tensor(model_inputs["labels"], dtype=torch.int32))
decoder_input_ids = decoder_input_ids.numpy().tolist()
model_inputs["decoder_input_ids"] = decoder_input_ids
return model_inputs
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["validation"]
train_dataset = train_dataset.select(range(max_train_samples))
eval_dataset = eval_dataset.select(range(max_eval_samples))
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=preprocessing_num_workers,
remove_columns=['document', 'summary', 'id'],
desc="Running tokenizer on train dataset",
)
eval_dataset = eval_dataset.map(
preprocess_function,
batched=True,
num_proc=preprocessing_num_workers,
remove_columns=['document', 'summary', 'id'],
desc="Running tokenizer on validation dataset",
)
return train_dataset, eval_dataset
train_dataset, eval_dataset = get_dataset("xsum")
for idx, eval_example in enumerate(eval_dataset):
eval_example.pop("labels")
decoder_input_ids = eval_example.pop("decoder_input_ids")
eval_example["decoder_input_ids"] = [2, 0] + decoder_input_ids[2:5]
for k in eval_example:
eval_example[k] = torch.tensor([eval_example[k]], dtype=torch.int32)
model.led.decoder.buffer = {}
output = model(**eval_example)
print(f"example idx: {idx}")
for k in model.led.decoder.buffer:
h = model.led.decoder.buffer[k]
if not isinstance(h, dict):
pass
# print(f'max diff in {k}: {np.amax(np.abs((h[0, 0] - h[0, 1]).detach().to("cpu").numpy()))}')
else:
layer_idx = k
buffer = h
for name in buffer:
h = buffer[name]
#print(f'layer {layer_idx} - {name}: max <eos> = {torch.max(torch.abs(h[0, 0]))}')
#print(f'layer {layer_idx} - {name}: max <bos> = {torch.max(torch.abs(h[0, 1]))}')
#print(f'layer {layer_idx} - {name}: max <eos> dim = {torch.argmax(torch.abs(h[0, 0]), dim=-1)}')
#print(f'layer {layer_idx} - {name}: max <bos> dim = {torch.argmax(torch.abs(h[0, 1]), dim=-1)}')
#top = torch.topk(torch.abs(h[0, 0]), k=8, dim=-1, largest=True, sorted=True)
#print(f'layer {layer_idx} - {name}: top <eos> indices = {top.indices}')
#print(f'layer {layer_idx} - {name}: top <eos> values = {top.values}')
#print(f'layer {layer_idx} - {name}: var <eos> = {torch.var(h[0, 0], unbiased=False)}')
#print(f'layer {layer_idx} - {name}: var <bos> = {torch.var(h[0, 1], unbiased=False)}')
if "hidden_states: ffn: final_layer_norm" in name:
print(f'max diff in layer {layer_idx} - {name}: {np.amax(np.abs((h[0, 0] - h[0, 1]).detach().to("cpu").numpy()))}')
print(f"-" * 20)
print(f'max diff in lm logits: {np.amax(np.abs((output.logits[0, 0] - output.logits[0, 1]).detach().to("cpu").numpy()))}')
print(f"-" * 20)
pred = torch.argmax(output.logits, dim=-1).detach().to("cpu").numpy().tolist()
print(f'predidcted token ids: {pred}')
print(f"=" * 40)
if idx >= 10:
break
For led-large
: note the difference is the maximal value of the absolute value of the hidden states between the 0-th position and 1-th position. More precisely: np.amax(np.abs(h[0, 0] - h[0, 1])
.
As you can see, no matter what the encoder input sequences are, the difference becomes really small along the layer depth.
example idx: 0
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 0.029722318053245544
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 0.0003014765679836273
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 9.097158908843994e-06
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 2.812594175338745e-07
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 4.470348358154297e-08
max diff in layer 6 - hidden_states: ffn: final_layer_norm: 1.7881393432617188e-07
max diff in layer 7 - hidden_states: ffn: final_layer_norm: 2.384185791015625e-07
max diff in layer 8 - hidden_states: ffn: final_layer_norm: 3.725290298461914e-09
max diff in layer 9 - hidden_states: ffn: final_layer_norm: 2.9802322387695312e-08
max diff in layer 10 - hidden_states: ffn: final_layer_norm: 1.4901161193847656e-08
max diff in layer 11 - hidden_states: ffn: final_layer_norm: 1.1920928955078125e-06
max diff in lm logits: 6.67572021484375e-06
predidcted token ids: [[133, 133, 4913, 815, 19931]]
========================================
example idx: 1
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 0.02129286527633667
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 0.0002829432487487793
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.203089237213135e-06
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 2.6635825634002686e-07
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 4.470348358154297e-08
max diff in layer 6 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 7 - hidden_states: ffn: final_layer_norm: 2.384185791015625e-07
max diff in layer 8 - hidden_states: ffn: final_layer_norm: 4.76837158203125e-07
max diff in layer 9 - hidden_states: ffn: final_layer_norm: 2.9802322387695312e-08
max diff in layer 10 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 11 - hidden_states: ffn: final_layer_norm: 3.814697265625e-06
max diff in lm logits: 1.0013580322265625e-05
predidcted token ids: [[448, 448, 40741, 3463, 1034]]
========================================
example idx: 2
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 0.015403840690851212
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 0.000291973352432251
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 9.2238187789917e-06
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 4.172325134277344e-07
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 2.9802322387695312e-08
max diff in layer 6 - hidden_states: ffn: final_layer_norm: 1.1920928955078125e-07
max diff in layer 7 - hidden_states: ffn: final_layer_norm: 7.450580596923828e-09
max diff in layer 8 - hidden_states: ffn: final_layer_norm: 3.725290298461914e-09
max diff in layer 9 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 10 - hidden_states: ffn: final_layer_norm: 5.960464477539063e-08
max diff in layer 11 - hidden_states: ffn: final_layer_norm: 4.76837158203125e-06
max diff in lm logits: 1.1444091796875e-05
predidcted token ids: [[0, 0, 385, 9, 6912]]
========================================
For led-base
.
Note that lm_logits
have a significant difference in the range [20, 30]
.
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 9.92125129699707
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 6.954092502593994
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.275293350219727
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 13.49088191986084
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 4.469869613647461
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 29.27507972717285
max diff in lm logits: 26.215885162353516
predidcted token ids: [[0, 133, 12, 815, 5142]]
========================================
example idx: 1
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 9.919170379638672
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 6.953605651855469
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.259047508239746
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 13.197162628173828
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 4.224005699157715
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 29.185691833496094
max diff in lm logits: 28.350433349609375
predidcted token ids: [[0, 846, 40741, 3463, 3449]]
========================================
example idx: 2
max diff in layer 0 - hidden_states: ffn: final_layer_norm: 9.921760559082031
max diff in layer 1 - hidden_states: ffn: final_layer_norm: 6.953545570373535
max diff in layer 2 - hidden_states: ffn: final_layer_norm: 8.30044937133789
max diff in layer 3 - hidden_states: ffn: final_layer_norm: 13.065882682800293
max diff in layer 4 - hidden_states: ffn: final_layer_norm: 3.919126510620117
max diff in layer 5 - hidden_states: ffn: final_layer_norm: 28.759159088134766
max diff in lm logits: 26.200252532958984
predidcted token ids: [[0, 35731, 385, 9, 6912]]
========================================
Hmm that's very interesting. A couple of pointers that might help:
bart-large
always forces the second token to be the BOS token during generation (see https://huggingface.co/facebook/bart-large/blob/main/config.json#L27) where as led-large doesn't. However led-large
should probably do this as well since led-large
is based of bart-large
led-large
has exactly the same weights as bart-large
. The only difference is that led-large
has some additionally randomely initialized layers for the global attentionAlso @ibeltagy - have you seen something like the above already by any chance?
Also one last comment, note that just because "</s> <s>"
always predicts the same token regardless of the encoder outputs doesn't mean training is necessarily broken. During training all decoder_input_ids
start with </s><s>
and then the model should learn the correct behavior, but it might indeed be a good idea to perturb the bos token.
In general, I wouldn't recommend using both </s>
and <s>
as prompt tokens for the decoder_input_ids
but that's how fairseq has done it with BART
For the record: bart-large
seems learned to predict the first token after <s>
in the encoder input sequence, for both the first two decoder tokens [</s>, <s>]
. I provide a script to confirm this in [this comment].(https://github.com/huggingface/transformers/issues/15559#issuecomment-1217894635).
For led-large-16384
, same situation. But when this is not the case, it gives [<s>, <s>]
. This happens quite often, and I think it explains why we get [</s>, <s>, <s>, <s>, ...]
after finetuning.
@ratishsp
I could confirm that the trick of perturbing the bos
token's embedding works for led-large-16384
. You can simply adding the following block after the line https://github.com/huggingface/transformers/blob/49e44b216b2559e34e945d5dcdbbe2238859e29b/examples/pytorch/summarization/run_summarization.py#L425
would work.
Please let us know if this works for you!
Here is the code to add:
import torch
from transformers.modeling_utils import _load_state_dict_into_model
d = model.state_dict()
d["led.decoder.embed_tokens.weight"][0] = d["led.decoder.embed_tokens.weight"][0] + torch.randn(1024)
_load_state_dict_into_model(model, d, "led.")
Hi @ratishsp Hope the above solution works for you. I am going to close this issue, but if you have further question, don't hesitate to reopen.
Hi @ydshieh, sorry for the late reply... I had got busy with other stuff. I tried the above fix of perturbing the weights for bos. But it didn't work for me.
@ratishsp Sorry to hear that, I am not sure what I can help further here, as the issue is found and a fix is provided which worked on my side (and some other users previously).
If you can open a new working branch, add your fix there and share it with us + with the list of training arguments used in your latest attempt, we could try to find some time to see if there are other things go wrong there.
Hi @ydshieh I have followed an identical setup as mentioned at the beginning of the thread but with latest version of Transformers repo. Sure, I can open a branch, add a fix and share with you. Meanwhile, will it be possible for you to share tensorboard log of your run similar to the one here https://github.com/huggingface/transformers/issues/18190#issuecomment-1189139463?
Hi @ratishsp . If you ever try to run it again with a branch that is aimed to share with us, there 2 two fixes to take into account:
https://github.com/huggingface/transformers/issues/18190#issuecomment-1210958506 https://github.com/huggingface/transformers/issues/18190#issuecomment-1218408325
I also strongly suggest that you manually investigate if the bos token embedding is changed before and after this (newly added) line
_load_state_dict_into_model(model, d, "led.")
I didn't keep the training log - I tried the fix with a training up to around 2K (or 3K maybe) steps, and didn't see this </s><s><s>...
anymore (while I tried without the fix, it did occur as you described)
Once you have the code (with the fixes mentioned above that you will add), we can see if there is some mistake. And if you still get </s><s><s>...
, I will try to run it myself. (BTW, I won't be available next week).
Hi @ydshieh, I have created a branch with fixes at https://github.com/ratishsp/transformers-fix. I trained two models: LED-Base and LED-Large with the identical code. The training commands are the same as given earlier in the thread https://github.com/huggingface/transformers/issues/18190#issue-1308379298. Below tensorboard logs show that the issue still exists.
Thanks @ratishsp . Will take a look once I am back!
Hi @ratishsp
As promised, I checked. You are right, perturbing bos token embedding is not helping for the checkpoint allenai/led-large-16384
. (well, it helps a bit at the first few iterations, but once the steps continue, we get the same </s><s><s>
.)
I ran out of the ideas, the only thing works is to avoid using </s> <s> <tok_1> <tok_2> ...
when preparing labels
. Instead, just using </s> <tok_1> <tok_2> ...
. To do so, add the following block after the line
https://github.com/huggingface/transformers/blob/4dd784c32f76fb8285f205b94e2a6ebde731a1cd/examples/pytorch/summarization/run_summarization.py#L536
# Originally, the `labels` are of the form: </s> <s> ..., which causes trouble for finetuning some checkpoints.
# Let's try to remove <s> (`bos` token) in `labels`, i.e. keep only the decoder_start_token (here </s>).
model_inputs["labels"] = [x[1:] for x in model_inputs["labels"]]
Or you can simplify using my branch debug_led_large_bad_generation - this will save the generations after each evaluation.
You can verify the effect with (and without) this change by running a tiny training (with very few examples) below:
./run_summarization.py \
--model_name_or_path allenai/led-large-16384 \
--dataset_name xsum \
--output_dir ./led-large-16384-xsum-no-bos-dummy-1 \
--overwrite_output_dir \
--logging_dir ./led-large-16384-xsum-no-bos-dummy-logs-1 \
--do_train \
--do_eval \
--predict_with_generate \
--report_to tensorboard \
--load_best_model_at_end \
--greater_is_better True \
--metric_for_best_model rougeL \
--per_device_train_batch_size=1 \
--per_device_eval_batch_size=4 \
--evaluation_strategy steps \
--max_steps 500 \
--max_train_samples 500 \
--max_eval_samples 100 \
--logging_steps 100 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 10 \
--generation_max_length 128 \
--num_beams 3
Let me know if you can get normal results with this change 🙏 Thank you!
Hi @ydshieh, it works! Thanks.
@ratishsp I am super glad it also works for you 🤗 !
I will discuss with my colleagues where to put this information in our documentation, so there will be more clear reference to this issue and workaround.
Hi @ydshieh ,
I'm facing the same problem but for another model, here is a link to the issue,
The finetuning works fine, and the loss is decreasing as expected, but the model doesn't generate any sequences, is there a way to modify the generation logic to get something out, without re-finetuning the model?
I answered in your post on the forum, but just in case
Hi @customer101 , could you provide a script (or the command you used to launch the training) that could reproduce the issue, please?
If you want to proceed on GitHub , it's better to open a new issue instead of in this thread. Thank you.
System Info
transformers
version: 4.20.0.dev0Who can help?
@ydshieh
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The logs shows that at checkpoint 1800 the rouge becomes zero.
{'eval_loss': 2.172360897064209, 'eval_rouge1': 0.0, 'eval_rouge2': 0.0, 'eval_rougeL': 0.0, 'eval_rougeLsum': 0.0, 'eval_gen_len': 20.0, 'eval_runtime': 10.2823, 'eval_samples_per_second': 9.725, 'eval_steps_per_second': 2.431, 'epoch': 0.04}
I evaluate the model output using the below function:
It produces the output
</s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>
Expected behavior
The model should produce the summary of the news article.