Closed StephAO closed 2 years ago
I have encountered the same problem. And I have also found that bart-large
cannot be fine-tuned with a reasonable output.
@patil-suraj - we had another issue thread about this somewhere no? I can't find it anymore though :-/
@patrickvonplaten It might be this one that I linked in the original description: https://github.com/huggingface/transformers/issues/8005
Found the original issue: https://github.com/huggingface/transformers/issues/9731 . Looking a bit into the commit history here: https://huggingface.co/facebook/bart-large/commits/main it looks like the mask token problem actually only existed for bart-base
and not for bart-large
according to @patil-suraj .
@patil-suraj - could you double-check this real quick?
I can confirm that bart-large generates very strange output.
tokenizer = BartTokenizer.from_pretrained(model_name)
model = FlaxBartForConditionalGeneration.from_pretrained(model_name)
model.params = model.to_bf16(model.params) # convert float16 to bfloat16 (for TPU)
sentences = (
'She waded to the bank and picked up her shoes and stockings.',
'The bank is increasing the amount they lend to small companies.',
)
inputs = tokenizer(sentences, padding=True, return_tensors='jax')
output = model.generate(inputs.input_ids)
print(tokenizer.batch_decode(output.sequences, skip_special_tokens=True, clean_up_tokenization_spaces=False))
facebook/bart-base
:
['She waded to the bank and picked up her shoes and stockings.', 'The bank is increasing the amount they lend to small companies.']
facebook/bart-large
:
['She.....', 'TheThe']
To me this seems to be unrelated to #9731; @patrickvonplaten's previous method to check whether the correct mask-token is used, produces a difference of 0 when used with the large model. So it looks like this is not related to the mask-token, as @ayaka14732's example does not even use masks.
I see sorry you're right - I looked too quickly indeed. Will take a deeper look in the coming days.
Any updates?
Hey @StephAO,
You are passing forced_bos_token_id
to the tokenizer
instead of the model. It should be passed to the model. When running this code:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
batch = tokenizer("My friends are <mask> but they eat too many carbs.", return_tensors="pt")
generated_ids = model.generate(batch["input_ids"])
print(tokenizer.decode(generated_ids[0]))
gives sensible outputs:
</s><s>My friends are good people, but they eat too many carbs.</s>
Where did you see code that forced_bos_token_id
should be passed to the tokenizer?
Good catch. I am actually unsure where/if I saw code that passed forced_bos_token_id
to the tokenizer, it is possible that when I was playing around with the model to try and get it to work, I ended up adding the argument to the wrong spot. That being said, the documentation is unclear in a few places:
force_bos_token_to_be_generated=True
instead of forced_bos_token_id=0
(which I assume is the old version of this argument?)forced_bos_token_id
as a possible argument. In fact, the only place I can find it defined is in GenerationMixin, and even there it is not clear what this argument does.Lastly, it is still confusing to me why this is required for bart-large, but not for bart-base. Either way, thank you for the update!
Thanks for the great feedback! Actually if you would be interested, it would be amazing to open a PR to fix the docs - both the mask filling example and the implementation notes - to fix this :-)
Otherwise I'm happy to open a PR for it as well!
@patrickvonplaten Thanks for the explanitaton, it helps a lot! Actually, I think it does not only effect mask infilling, but also the fine-tuning procedure (at least on my side). Without the forced_bos_token_id
, the bart-large
model cannot be fine-tuned well. Therefore, I recommend to set the default value of forced_bos_token_id
inside the BART model config file if you do not mind.
Hmm, the problem is that I'm not sure whether this should be done or not bart-base
and bart-large
. E.g. when the model was added the author explicitly mentioned that forced_bos_token_id
should only be used for bart-large-cnn
by default - see: https://github.com/huggingface/transformers/blob/818878dc881a32c949844f734af5a8ce25385660/src/transformers/configuration_bart.py#L109
This is also why only bart-large-cnn
has this config attribute set by default -> see: https://huggingface.co/facebook/bart-large-cnn/blob/main/config.json#L27 and the other don't.
@sshleifer sorry to ping you here - do you remember by any chance if forced_bos_token_id
is recommended to be used when fine-tuning BART? E.g. should one always place BOS after decoder_start_token_id
for fine-tuning?
Also cc @patil-suraj any ideas?
I'm not sure whether this should be done or not bart-base and bart-large
Sam will have a better answer but IIRC the forced_bos_token_id
(previously, force_bos_token_to_be_generated
) was added to be able to reproduce the bart-large-cnn
results. And in our experiments, we had found that this is only required for the cnn pre-trained model and other checkpoints were not affected by this. Found a related discussion and PR.
do you remember by any chance if
forced_bos_token_id
is recommended to be used when fine-tuning BART? E.g. should one always place BOS afterdecoder_start_token_id
for fine-tuning?
In my experiments, it actually doesn't matter. It depends on how the decoder_input_ids
are prepared, for example, if the decoder_input_ids
are like this: [eos, bos, .....]
then BOS should be forced.
But now the issue is, in bart like models the decoder_input_ids
are prepared by calling shift_tokens_right
function on the labels
if decoder_input_ids
are not passed. This is how it's done in our summarization and translation fine-tuning examples. The decoder_inputs_ids
are prepared in DataCollatorForSeq2Seq
by calling model.prepare_decoder_input_ids_from_labels
which then calls shift_tokens_right
https://github.com/huggingface/transformers/blob/05c237ea94e08786abbac6c6185cfdfa262a8c53/src/transformers/data/data_collator.py#L600
And this will always add bos
after eos
. See:
from transformers import BartTokenizer
from transformers.models.bart.modeling_bart import shift_tokens_right
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
labels = tokenizer("This is a test", return_tensors="pt").input_ids
decoder_input_ids = shift_tokens_right(labels, tokenizer.pad_token_id, tokenizer.eos_token_id)
decoder_input_ids
# tensor([[ 2, 0, 713, 16, 10, 1296]])
tokenizer.batch_decode(decoder_input_ids)
['</s><s>This is a test']
but forced_bos_token_id
is not set by default. So this might affect generations for models trained using these scripts.
Thanks for the summary @patil-suraj - that's super helpful!
So as I undertsand it, the BartTokenizer will always add the BOS token to the beginning. E.g.:
from transformers import BartTokenizer
tok = BartTokenizer.from_pretrained("facebook/bart-large")
print(tok.decode(tok("hello").input_ids))
Out[4]: '<s>hello</s>'
This means for Seq2Seq if someone follows any of our official examples (both accelerate and Trainer), this means the labels are created to be:
<s> label text </s>
with the decoder input ids then (since they are the labels shifted to the left):
<decoder-start-token-id><s> label text </s>
Now, we know from Sam's comment here that it is not necessary to add the BOS token (<s>
) for successful fine-tuning. One also gets good results when not adding the BOS token - i.e. it doesn't really make a difference.
But the problem now is that while we quitly add BOS to the labels in all of our example scripts for fine-tuning because of BartTokenizer's behavior, we don't "force-add" the BOS token when evaluating the model with generate
because forced_bos_token_id
is not set by default in the config. This means that while the model is well-trained when using the examples the evaluation results are probably not as good as they could be if we would force the BOS token to be generated.
On the other hand, one could also argue that the model should learn to always generate <BOS>
as the first token and it's not needed. However, we know that results are better if we force <BOS>
to be generated if the model has been trained as explained above.
As a conclusion, @patil-suraj and I think, we should actually add forced_bos_token_id=0
by default to the pretrained bart models: bart-large and bart-base . This would be a breaking change for people that use bart-large
by default for mask-filling with generate, but as seen above it should improve results.
Since those two checkpoints are highly used, keen to hear your opinion @LysandreJik @sgugger here
Are we 100% sure that the change would only make predictions better in all circumstances?
Are we 100% sure that the change would only make predictions better in all circumstances?
If someone fine-tunes with a BOS as the second token (behind decoder_start_token_id
) - which is done in all example scripts, then yes the change would make predictions better in all circumstances for the fine-tuned model.
I'm not talking about fine-tuning, I'm talking about users relying on this model in production right now.
This is a pretrained checkpoint - so I highly doubt anybody uses this model + generate()
in production. The only use case is to generate a <mask>
token as shown in the issue description. It doesn't make sense to use this in production (why use expensive generate for a <mask>
token instead of BERT?). And even for this use case the change works better (however this is hard to test)
Does this problem can cause a pretrained model from a scratch to show poor loss and accuracy score? I already pre-train a model with TensorFlow Keras and BART and it show me a good logit accuracy (~80%) for base-scale. However, i was struggling for months to use BART large using the same experimental setup. The logit accuracy for BART large never goes beyond ~4%. I did everything including decreasing and increasing batch size, learning rate .. etc but with no luck.
@salrowili Yeah, I can confirm it will. Compared with bart-base
, the prediction of bart-large
seems to be with some randomness, even an experiment which evaluates if it could convergence on a small and toy dataset.
@salrowili Yeah, I can confirm it will. Compared with
bart-base
, the prediction ofbart-large
seems to be with some randomness, even an experiment which evaluates if it could convergence on a small and toy dataset.
Did you manage to fix it with forced_bos_token_id=0 solution?
@salrowili Yeah, I can confirm it will. Compared with
bart-base
, the prediction ofbart-large
seems to be with some randomness, even an experiment which evaluates if it could convergence on a small and toy dataset.Did you manage to fix it with forced_bos_token_id=0 solution?
According to my efforts now, I can observe the following facts:
forced_bos_token_id=0
, the model bart-large
can be fine-tuned to overfit on a small toy dataset, while the loss
is still abnormal.WikiSQL
, the model bart-large
shows promising fine-tuning results on some experimental runs, e.g., showing a comparable performance with the fine-tuned model by fairseq
. However, some experimental runs will fail when switching the fine-tuning environment (e.g., from single-card to multi-card). And it often occurs with toooo long evaluating time since the model is trying to produce a sequence such as </s> <s> <s> <s> <s> ...
until it arrives the end of val_max_target_length
. These failure predictions will all be empty, and these errors may not be solved by the forced_bos_token_id=0 solution.
<s>
corresponds to thebos_token_id
, and</s>
corresponds to both thedecoder_start_token_id
andeos_token_id
.
One failure case can be as below, and the denotation_accuracy
shows a BIG jump among fine-tuning steps:
@salrowili Yeah, I can confirm it will. Compared with
bart-base
, the prediction ofbart-large
seems to be with some randomness, even an experiment which evaluates if it could convergence on a small and toy dataset.Did you manage to fix it with forced_bos_token_id=0 solution?
According to my efforts now, I can observe the following facts:
- With setting
forced_bos_token_id=0
, the modelbart-large
can be fine-tuned to overfit on a small toy dataset, while theloss
is still abnormal.- On a real dataset
WikiSQL
, the modelbart-large
shows promising fine-tuning results on some experimental runs, e.g., showing a comparable performance with the fine-tuned model byfairseq
.However, some experimental runs will fail when switching the fine-tuning environment (e.g., from single-card to multi-card). And it often occurs with toooo long evaluating time since the model is trying to produce a sequence such as
</s> <s> <s> <s> <s> ...
until it arrives the end ofval_max_target_length
. These failure predictions will all be empty, and these errors may not be solved by the forced_bos_token_id=0 solution.
<s>
corresponds to thebos_token_id
, and</s>
corresponds to both thedecoder_start_token_id
andeos_token_id
.One failure case can be as below, and the
denotation_accuracy
shows a BIG jump among fine-tuning steps:
Thank for sharing this interesting findings. I have also conduct a small experiment involving fine-tuning SQuAD with both BART-base and BART-large with Pytorch XLA on TPU-8 unit with Google Colab (attached). its based on this colab https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb . Although i uses Google Colab pro which has more memory 35GB, i think Google Colab free gives free TPU with 25GB so for everyone who is interesting in replicating this experiment can do it , but you may need to reduce the value of "per_device_train_batch_size" to use less memory. per_device_train_batch_size*8= Total Batch size. If you ran into out of memory error reduce the per_device_train_batch_size. if you got SIG error restart the colab and run again all cells that are needed to restore variable and importing packages. At beginning the training will be slow since XLA compilation needs more time especially for bart-large (~10mins). largest batch size is 16 for bart-large (per_device_train_batch_size=2) I ran both large and base for one epoch. loss score are very close and it seems fine-tuning BART-large is not affected by this issue. I tried to do prediction with BART-large but i got OOM error upon saving the model. However, i needed to add skip_special_tokens=True to tokenizer.decode function to get rid of s and pad tokens in BART-base. I am also still worry about pre-training BART large from scratch because it involve using a lot of resource and uses [mask] token. Comparing_BART_base_vs_BART_large_on_TPU.zip
@salrowili Thanks for sharing! I think currently this bug will affect all NLG tasks (e.g., summarization), but not for NLU tasks (e.g., classification and extractive machine reading comprehension). I get some ideas why it becomes so, and would like to share here later when it is confirmed.
In #8005 it was mentioned that the forced BOS-token is correct for mask-infilling, but may be suboptimal for other tasks.
At least for generation tasks the forced BOS-token seems necessary, right? Maybe the solution could be to implement the originally proposed solution of adding the forced BOS-token as task_specific_params
for all generation tasks, although I'm not sure how these parameters are used.
Maybe alternatively forced_bos_token_id=-1
could be used as a new default in the configs for all BART-models; then each BART-subclass could set its preferred default, in case -1
is set. That would be forced_bos_token_id = config.bos_token_id
for BartForConditionalGeneration
and forced_bos_token_id = None
for all others (?) (provided all others benefit from a lack of a forced BOS-token).
Although that might be confusing, as the config's value of -1
is never actually used.
As a sidenote: The Implementation notes
for BART in general seem at least partially outdated. fairseq.encode
is not used by the tokenizer at all. Also prepending a space to any text passed to the tokenizer actually makes generations worse, even if the BOS-token is forced:
>>> tokenizer.batch_decode(model.generate(tokenizer("hello", return_tensors="pt").input_ids))
['</s><s>hello</s>']
>>> tokenizer.batch_decode(model.generate(tokenizer(" hello", return_tensors="pt").input_ids))
['</s><s></s>']
@patrickvonplaten I find that note misleading and would advocate for it to be removed/clarified.
Hi guys, after being confused by this question for a while months, eventually I found the root issue and the fundamental solution to the failure of bart-large
:
bos_token
(i.e., the token 0
) in bart-large
pre-trained model.bos_token
representation.BART (Lewis et al. 2020) is a popular pre-trained encoder-decoder model which is for both Natural Language Understanding (e.g, sequence classificaiton) and Natural Language Generation (e.g., text summarization) tasks. However, there are two known issues (#8005, #15559) for BART family:
</s><s>My friends are healthy, but they eat too many carbs.</s>
), while BART-Large's output (e.g., </s>My,, but they eat too many carbs.</s>
) is wonky.This answer provides a solution that making config.force_bos_token_to_be_generated
as True
could encourage BART-Large to behave as expected in the task mask filling.
The corresponding latest solution (for v4.17.0
) is as:
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
However, such a setting cannot solve the issue encountered in the fine-tuning procedure perfectly. In my premiliary study, the prediction on the small toy task can be perfect by setting forced_bos_token_id=0
, while not for the loss. The loss also jumps during different evaluation steps, making me very confused. After digging into the model prediction for a while days, I finally found that the most important thing in this steaky bug may be the bos_token
(i.e., token <s>
or the 0
token id). There are always two strange phenomena corresponding to bos_token
:
decoder_start_token
is fed to the decoder, it is possible that the model does not predict bos_token
, even after thousands of training steps. It is very strange since the only corresponding output of decoder_start_token
is bos_token
! It is because that during data preparation, the BART model always prefixes each output with the bos_token
(details see here).</s> <s> <s> <s> <s> ...
(i.e., token id 2 0 0 0 0 ...
) until it arrives the end of maximum target length.After setting forced_bos_token_id=0
, it forces the model to predict bos_token
when decoder_start_token
is fed to the decoder, and it can fix the first case, but not for the second case!
As stated above, the most strange part in this bug is that BART-Large does not act the same as BART-base on both mask filling tasks and the fine-tuning procedure. However, after carefully checking and confirming that their configs are the same, I believe there must be something "wrong" with BART-Large's pre-trained model weight.
Then I try to print the norm of each token for BART:
from transformers import BartModel, BartTokenizer
import torch
model = BartModel.from_pretrained("facebook/bart-base")
print(model.shared.weight.norm(dim=1))
# tensor([1.7076, 0.7572, 1.6171, ..., 0.9508, 0.9553, 1.6455], grad_fn=<NormBackward3>)
model = BartModel.from_pretrained("facebook/bart-large")
print(model.shared.weight.norm(dim=1))
# tensor([6.9603, 0.4269, 2.4219, ..., 1.3628, 1.3857, 2.4098], grad_fn=<NormBackward3>)
As bos_token
(token 0) and eos_token
(token 2) both appear in all training sentences, they should have a similar embedding norm. However, for BART-base the 0-th norm is close to 2-nd norm, but not for BART-Large! For BART-Large, the norm of 0-th norm is as high as 6.96, nearly twice as the 2-nd norm. Therefore, I think the fine-tuning failure is caused by ill-conditioned representation of token bos_token
in BART-Large, at least for the current optimizer in 🤗 transformers.
Then, I tried a simple solution which only adds some noise to the bos_token
representation as below:
import torch
# please download the model in your local directory and manually change the weight
state_dict = torch.load("bart-large/pytorch_model.bin")
state_dict["model.shared.weight"][0] = state_dict["model.shared.weight"][0] + torch.randn(1024)
torch.save(state_dict, open("bart-large/pytorch_model.bin", "wb"))
bos_token
representationThe above solution works! On both toy benchmarks and real datasets, the loss during evaluation becomes normal, and the performance after fine-tuning BART-Large is expected. Most importantly, even without setting forced_bos_token_id=0
, the output of fine-tuned BART-Large models are satisfactory.
In summary, I recommend you guys to perturb the representation of bos_token
to obtain great fine-tuning performance of BART-Large on NLG tasks.
Thanks! I am still curious about:
Thank you @SivilTaram for this detailed explanation. Wonderful effort.
Could this problem be related to mBART? because mBART has only a large-scale version, not a base scale and I think the first token is designed for language code. see https://github.com/huggingface/transformers/pull/9811
Thanks a lot for all the work guys! I think at this point we can only add forced_bos_token_id=0
to the config of the pretrained checkpoints as discussed previously - @patil-suraj and I just did it here:
Also maybe we can ask the official authors what they think about your findings. Gently pinging @ngoyal2707 here
Thanks a lot for all the work guys! I think at this point we can only add
forced_bos_token_id=0
to the config of the pretrained checkpoints as discussed previously - @patil-suraj and I just did it here:
- https://huggingface.co/facebook/bart-large/commit/030bb1bda8b56e9a918a6f3e764cdfeffa781ccc
- https://huggingface.co/facebook/bart-base/commit/d0af9887e98da87934e6ecaf42e724f1684bb72b
Also maybe we can ask the official authors what they think about your findings. Gently pinging @ngoyal2707 here
Yes I agree that my finding on perturbing bos_token
cannot be a default option for bart-large
since it will affect a lot of users, maybe until we finally figure out why the perturbing works.
Thanks! I am still curious about:
- Why does this issue happen with bart-large but not bart-base?
- Is this issue only related to the Hugging Face model, or affects the model in the original Facebook repository as well?
bart-large
or bart-base
, but the fact is that the token bos_token
makes bart-large
hard to optimize in the context of 🤗 transformers.bart-large
related models on some NLG datasets. If it is finally a optimization issue, I think we may encounter the same issue in fairseq
after trying enough random seeds. However, everything in fairseq
under the default random seed seems okay for me by now.Thank you @SivilTaram for this detailed explanation. Wonderful effort.
Could this problem be related to mBART? because mBART has only a large-scale version, not a base scale and I think the first token is designed for language code. see #9811
Good point @salrowili ! I have no idea now, but I agree that the first token is originally desgined to serve for mBART instead of BART.
what is your loss score on both, not EM? I am curious to know
This might have been a false alarm, let me triple check
Yeah false alarm my bad folks.
Actually right now with the same transformers bart for conditional generation instance, AllenAI's beam search gives me near 100% EM and hugging face's gives me very low EM, with a much lower per token accuracy (~65%). Both with num_beams of 4 and max length of 500, trying to figure out the difference...
Ok found the cause of the difference, the huggingface generate
reads a no_repeat_ngram_size
of 3
from the bart
config and the AllenAI decoder does not look at it.
Can also confirm that gradients of bart-large
are generally very high which might lead to unexpected behavior. See: https://discuss.huggingface.co/t/gradients-verification-between-jax-flax-models-and-pytorch/15970
Can also confirm that gradients of
bart-large
are generally very high which might lead to unexpected behavior. See: https://discuss.huggingface.co/t/gradients-verification-between-jax-flax-models-and-pytorch/15970
Good job! This reinforces my belief that it is bart-large
itself that is sensitive to optimization, and not a bug in the codebase.
We just found out that bart-large
has its weights accidently stored in fp16 on the Hub see: https://github.com/huggingface/transformers/issues/16736
This might be a reason for this behavior here
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
There is a similar issue with led-large
, which has the same weight as bart-large
. See my comment here.
In short, no matter what the encoder input sequences are, the decoder sequence [</s>, <s>]
will produce identical (differences are in the range ~1e-6) LM logits (before any finetuning). This explains why we have training trouble, as well as why perturbing bos_token
makes sense. However it's not clear why this happens.
I am not very convinced by @SivilTaram's reasoning on the norms of <eos>
and <bos>
. As if they have larger differences in norm for bart-large
, it means they have more different embeddings as inputs, but the output LM logits are very close at the end. This looks very strange.
I have verified the same situation occurs for bart-large
using HuggingFace's transformers
.
Furthermore, the following code snippet confirms the same for fairseq
's bart-large
.
import torch
bart = torch.hub.load('pytorch/fairseq', 'bart.large')
bart.eval() # disable dropout (or leave in train mode to finetune)
model = bart.model
#print(model.decoder.embed_tokens.weight[0][:16])
#print(model.decoder.embed_tokens.weight[2][:16])
for _ in range(20):
# random encoder input sequences with random length
seq_len = torch.randint(low=8, high=64, size=(1,))[0]
src_tokens = torch.randint(low=0, high=50265, size=(1, seq_len), dtype=torch.int32)
src_tokens = torch.cat([torch.tensor([[0]], dtype=torch.int32), src_tokens, torch.tensor([[2]], dtype=torch.int32)], dim=-1)
src_lengths = seq_len + 2
prev_output_tokens = torch.tensor([[2, 0]], dtype=torch.int32)
o = model(
src_tokens=src_tokens,
src_lengths=src_lengths,
prev_output_tokens=prev_output_tokens,
)
print(o[0].shape)
print(o[0][0, :2, :16])
The results:
tensor([[11.3236, -0.9138, 6.5245, -0.3914, 9.7808, 2.2912, 8.1962, 1.9227,
4.7425, 3.1021, 2.6246, 3.7791, 9.5079, 2.6686, 2.1119, 2.0344],
[11.3236, -0.9138, 6.5245, -0.3914, 9.7808, 2.2912, 8.1962, 1.9227,
4.7425, 3.1021, 2.6246, 3.7791, 9.5079, 2.6686, 2.1119, 2.0344]],
grad_fn=<SliceBackward0>)
torch.Size([1, 2, 50265])
tensor([[10.7945, -1.0372, 7.0184, -0.3236, 10.4561, 2.6929, 8.5290, 3.1237,
5.1213, 3.5141, 2.8864, 3.9884, 10.1654, 3.6481, 1.8212, 2.2971],
[10.7945, -1.0372, 7.0184, -0.3236, 10.4561, 2.6929, 8.5290, 3.1237,
5.1213, 3.5141, 2.8864, 3.9884, 10.1654, 3.6481, 1.8212, 2.2971]],
grad_fn=<SliceBackward0>)
torch.Size([1, 2, 50265])
tensor([[ 9.5601, -1.0508, 6.5913, -1.1130, 10.1133, 2.0899, 8.3389, 2.9828,
4.7401, 3.2962, 1.6522, 3.3447, 10.1464, 3.2324, 2.4950, 1.9132],
[ 9.5601, -1.0508, 6.5913, -1.1130, 10.1133, 2.0899, 8.3389, 2.9828,
4.7401, 3.2962, 1.6522, 3.3447, 10.1464, 3.2324, 2.4950, 1.9132]],
grad_fn=<SliceBackward0>)
torch.Size([1, 2, 50265])
tensor([[13.9500, -1.1926, 8.6385, -1.3860, 10.3129, 2.6660, 9.1016, 3.1353,
5.2579, 3.4807, 2.5914, 3.9124, 11.1577, 3.0453, 1.7675, 2.6812],
[13.9500, -1.1926, 8.6385, -1.3860, 10.3129, 2.6660, 9.1016, 3.1353,
5.2579, 3.4807, 2.5914, 3.9124, 11.1577, 3.0453, 1.7675, 2.6812]],
grad_fn=<SliceBackward0>)
torch.Size([1, 2, 50265])
tensor([[16.2144, -0.7714, 9.2828, 0.7236, 11.1654, 1.8226, 9.0874, 2.7226,
5.5879, 3.4060, 1.7666, 4.0211, 11.2787, 2.9147, 1.6007, 1.7492],
[16.2144, -0.7714, 9.2828, 0.7236, 11.1654, 1.8226, 9.0874, 2.7226,
5.5879, 3.4060, 1.7666, 4.0211, 11.2787, 2.9147, 1.6007, 1.7492]],
grad_fn=<SliceBackward0>)
For the record: bart-large
seems learned to predict the first token after <s>
in the encoder input sequence, for both the first two decoder tokens [</s>, <s>]
. I provide a script to confirm this at the end.
Here are some outputs
example idx: 1
max diff in lm logits: 1.621246337890625e-05
--------------------
predicted token ids: [1640, 1640, 13989, 212, 4038]
predicted tokens: ['(', '(', 'Ä MLS', 'th', 'Ä anniversary']
document tokens: ['<s>', '(', 'CNN', ')', 'On', 'Ä the', 'Ä 6', 'th', 'Ä of', 'Ä April', 'Ä 1996', ',', 'Ä San', 'Ä Jose', 'Ä Clash', 'Ä and']
========================================
example idx: 10
max diff in lm logits: 7.033348083496094e-06
--------------------
predicted token ids: [11770, 11770, 16, 6308, 5678]
predicted tokens: ['March', 'March', 'Ä is', 'Ä contains', 'Ä links']
document tokens: ['<s>', 'March', 'Ä 10', ',', 'Ä 2015', 'Ä .', 'Ä We', "'re", 'Ä truly', 'Ä international', 'Ä in', 'Ä scope', 'Ä on', 'Ä Tuesday', '.', 'Ä We']
========================================
example idx: 20
max diff in lm logits: 7.62939453125e-06
--------------------
predicted token ids: [41650, 41650, 11, 1429, 224]
predicted tokens: ['Tok', 'Tok', 'Ä in', 'Ä Japan', 'Ä say']
document tokens: ['<s>', 'Tok', 'yo', 'Ä (', 'CNN', ')', 'Police', 'Ä in', 'Ä Japan', 'Ä say', 'Ä they', 'Ä have', 'Ä arrested', 'Ä a', 'Ä 40', '-']
========================================
example idx: 24
max diff in lm logits: 9.5367431640625e-06
--------------------
predicted token ids: [23122, 23122, 52, 9, 10]
predicted tokens: ['London', 'London', 'Ä we', 'Ä of', 'Ä a']
document tokens: ['<s>', 'London', 'Ä (', 'CNN', ')', 'A', 'Ä photo', 'Ä of', 'Ä a', 'Ä we', 'asel', 'Ä h', 'itching', 'Ä a', 'Ä surprise', 'Ä lift']
========================================
example idx: 36
max diff in lm logits: 8.106231689453125e-06
--------------------
predicted token ids: [4030, 4030, 2238, 18, 7396]
predicted tokens: ['New', 'New', 'Ä Korean', "'s", 'izes']
document tokens: ['<s>', 'New', 'Ä Delhi', ',', 'Ä India', 'Ä (', 'CNN', ')', 'The', 'Ä North', 'Ä Korean', 'Ä ambassador', 'Ä in', 'Ä Bangladesh', 'Ä issued', 'Ä an']
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import datasets
summarization_name_mapping = {
"cnn_dailymail": ("article", "highlights"),
"xsum": ("document", "summary"),
}
def get_dataset(dataset_name, dataset_config, tokenizer, n_samples):
max_source_length = 1024
max_target_length = 128
padding = True
ignore_pad_token_for_loss = True
padding = "max_length"
prefix = ""
max_train_samples = n_samples
max_eval_samples = n_samples
preprocessing_num_workers = 8
raw_datasets = datasets.load_dataset(dataset_name, dataset_config)
text_column, summary_column = summarization_name_mapping[dataset_name]
def foo(x):
if x == tokenizer.cls_token_id:
return 1
elif x == tokenizer.pad_token_id:
return -1
else:
return 0
def preprocess_function(examples):
# remove pairs where at least one record is None
inputs, targets = [], []
for i in range(len(examples[text_column])):
if examples[text_column][i] and examples[summary_column][i]:
inputs.append(examples[text_column][i])
targets.append(examples[summary_column][i])
inputs = [prefix + inp for inp in inputs]
model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)
# Tokenize targets with the `text_target` keyword argument
labels = tokenizer(text_target=targets, max_length=max_target_length, padding=padding, truncation=True)
# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
if padding == "max_length" and ignore_pad_token_for_loss:
labels["input_ids"] = [
[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
]
model_inputs["labels"] = labels["input_ids"]
if tokenizer.__class__.__name__.startswith("LED"):
model_inputs["global_attention_mask"] = [[foo(y) for y in x] for x in model_inputs["input_ids"]]
return model_inputs
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["validation"]
train_dataset = train_dataset.select(range(max_train_samples))
eval_dataset = eval_dataset.select(range(max_eval_samples))
train_dataset = train_dataset.map(
preprocess_function,
batched=True,
num_proc=preprocessing_num_workers,
remove_columns=[text_column, summary_column, 'id'],
desc="Running tokenizer on train dataset",
)
eval_dataset = eval_dataset.map(
preprocess_function,
batched=True,
num_proc=preprocessing_num_workers,
remove_columns=[text_column, summary_column, 'id'],
desc="Running tokenizer on validation dataset",
)
return train_dataset, eval_dataset
def check_model(train_dataset, eval_dataset, model, tokenizer, n_samples, text_column, summary_column):
for idx, eval_example in enumerate(eval_dataset):
input_ids = eval_example["input_ids"]
decoder_input_ids = model.prepare_decoder_input_ids_from_labels(labels=torch.tensor([eval_example["labels"]], dtype=torch.int32))
decoder_input_ids = decoder_input_ids.numpy().tolist()
eval_example["decoder_input_ids"] = decoder_input_ids[0] # remove batch dim
eval_example.pop("labels")
decoder_input_ids = eval_example.pop("decoder_input_ids")
eval_example["decoder_input_ids"] = [2, 0] + decoder_input_ids[2:5]
for k in eval_example:
eval_example[k] = torch.tensor([eval_example[k]], dtype=torch.int32)
output = model(**eval_example)
print(f"example idx: {idx}")
print(f'max diff in lm logits: {np.amax(np.abs((output.logits[0, 0] - output.logits[0, 1]).detach().to("cpu").numpy()))}')
print(f"-" * 20)
pred_ids = torch.argmax(output.logits, dim=-1).detach().to("cpu").numpy().tolist()[0]
print(f'predicted token ids: {pred_ids}')
pred_tokens = tokenizer.convert_ids_to_tokens(pred_ids)
print(f'predicted tokens: {pred_tokens}')
document_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(f'document tokens: {document_tokens[:16]}')
print(f"=" * 40)
def run(checkpoint_name, dataset_name, dataset_config=None, n_samples=100):
text_column, summary_column = summarization_name_mapping[dataset_name]
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_name)
train_dataset, eval_dataset = get_dataset(dataset_name, dataset_config=dataset_config, tokenizer=tokenizer, n_samples=n_samples)
check_model(train_dataset, eval_dataset, model, tokenizer, n_samples, text_column, summary_column)
run("facebook/bart-large", "cnn_dailymail", "3.0.0", n_samples=100)
#run("facebook/bart-large", "xsum", None, n_samples=10)
#run("allenai/led-large-16384", "cnn_dailymail", "3.0.0", n_samples=10)
#run("allenai/led-large-16384", "xsum", None, n_samples=10)
I just saw this thread - sorry for all the pain here! The problem was caused by an unfortunate config bug in the original BART-large training run, which caused decoder sequences to start with an extra </ s> token
I assume that got fixed in BART-base, which is why it's behaving differently.
I just saw this thread - sorry for all the pain here! The problem was caused by an unfortunate config bug in the original BART-large training run, which caused decoder sequences to start with an extra </ s> token
I assume that got fixed in BART-base, which is why it's behaving differently.
Thank you so much for clarifying it @mikelewis0 well appreciated!
I just saw this thread - sorry for all the pain here! The problem was caused by an unfortunate config bug in the original BART-large training run, which caused decoder sequences to start with an extra </ s> token
I assume that got fixed in BART-base, which is why it's behaving differently.
I think you misinterpreted the cause. The extra </s>
token is intended and it is also used in bart-base.
Environment info
transformers
version: 4.16.2 (issue exists on 4.9.2)Who can help
@patrickvonplaten @sshleifer
Information
Essentially re-opening issue 8005, BART-large does not mask fill properly (whereas BART-base has entirely reasonable outputs). The previous fix of setting
force_bos_token_to_be_generated = True
is no longer viable since the option no longer exists in BART config. It also seems like adjust_logits_during_generation (where force_bos_token_to_be_generated was used) is no longer implemented in the BART model.To reproduce
Steps to reproduce the behavior: