Closed abarbet closed 1 year ago
Hey @abarbet 👋
This issue may arise when beam search, sampling, and long outputs are used together. A potential bug on PyTorch itself compounds it. You can read the full story in this issue.
TL;DR -- my immediate suggestion would be to avoid using num_beams
and do_sample
together. If you want to use them both, you'll have to read the issue linked above, which describes the problem and solutions :)
Ah thank you, that issue is very helpful! Do you have any idea why we would see a similar error in trlX
training despite not using beam sampling? I know you don't have access to my training script and also are most likely not familiar with their codebase, so this is a complete longshot.
The only thing I can think of if it's not caused by a sampling bug is some kind of destructive learning in the PPO step that causes token distributions to get completely out of whack.
@abarbet It may be due to this PyTorch issue, where the sampling step may pick very low probability tokens that it shouldn't and, in turn, cause computations to derail.
Try running your script with PT 1.x instead of 2.0!
@abarbet It may be due to this PyTorch issue, where the sampling step may pick very low probability tokens that it shouldn't and, in turn, cause computations to derail.
Try running your script with PT 1.x instead of 2.0!
For me, this issue also occurs with pytorch 1.13.1 https://github.com/huggingface/transformers/issues/22914#issuecomment-1562034753
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hello, has a fix been found for this issue? Using the latest version of transformers
and can confirm that when running inference using model.generate()
with parameters such as temperature
and do_sample
causes this issue.
summary_ids = model.generate(
inputs["input_ids"],
max_length=max_length,
min_length=128,
temperature=0.1,
do_sample=True,
# top_p=0.3
)
edit: can confirm now that do_sample
and temperature
is the cause of the issue as top_p
works fine for me
edit2: I forgot to mention that the model that I'm using is BRIO, loading pre-trained weights from HF
@yungsinatra0 The issue should only be gone with the next PT release (i.e. torch>2.0
)
System Info
transformers
version: 4.27.1Who can help?
@ArthurZucker @gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This has most recently arisen in using
trlX
to do reinforcement learning onflan-T5
. I wrote an issue on their own repo, but there seems to be no response, and it is somewhat more suited to be an issue in this repo since it has to do withtransformers
code at its core.The main issue is that
generate
with a seq2seq model, namelyflan-t5
, sometimes generates the following error:RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
. This has been well documented in other issues like this one, but the behavior in that issue is more custom than callinggenerate
in its standard configuration.Here is a code example to reproduce:
NB:
temperature
seems to be one of the main causes of this issue, as removing this kwarg from the generate call does not produce the error in the above case. However, that is not true of all cases. I have seen the error in mytrlX
training loops with kwargs as simple as:{"max_new_tokens": 512, "do_sample": True, "top_k": 0, "top_p": 1}
. Thus it seems this error is not always related to temperature.Expected behavior
The expected behavior in this case would be for the sampling to work every time instead of having strange edge cases where tokens are unreachable.