Closed norabelrose closed 1 year ago
@fxmarty I think you are more familiar with this topic? If so, could you take a look, thanks!
Hi @norabelrose, would you like to submit a PR?
Hi @norabelrose, would you like to submit a PR?
I already did! 😊 See #24941.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.32.0.dev0Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The output is too long to fit in a comment, so you'll have to run the code yourself. It features
"debug_backend() called with FX graph:"
being printed several times, each time followed with a fragment of the whole computation graph. This is not expected since NeoX has no data-dependent control flow.Expected behavior
The
torch.compile
backend should only be called once, and therefore"debug_backend() called with FX graph:"
should only appear once, because GPT NeoX does not actually require any data-dependent control flow.I've already checked that this can be fixed by turning
GPTNeoXAttention.norm_factor
into a Python scalar instead of a tensor. This is actually whattorch.baddbmm
expects for itsalpha
parameter; it's supposed to be a scalar. But it seems to silently convert tensors into scalars, so this doesn't cause a crash in normal use.The exact fix is, in
modeling_gpt_neox.py
, replace lines 103-107 with:and replace the
baddbmm
call inside_attn
with: