TF: XLA generation not working properly in some models

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.01k stars 26.79k forks source link

TF: XLA generation not working properly in some models #17935

Open gante opened 2 years ago

gante commented 2 years ago

This issue is used to track TensorFlow XLA generation issues, arising from #17857. There are three categories of issues, sorted in descending order by severity:

Key model issues

These are heavily-used models, whose quality should be prioritized.

[x] T5 -- The quality of the results decreases with max_length. See here.
[x] GPT-J -- fails simple generate tests with numerical issues

Models failing basic tests

These models are failing test_xla_generate_fast -- a short greedy generation.

[ ] LED
[ ] Speech2Text
[ ] XLNet
[ ] XGLM

Models failing complex tests

These are models failing test_xla_generate_slow -- a long beam search generation.

[x] Bart
[x] Blenderbot
[x] Marian
[x] mbart
[x] OPT
[x] Pegasus

anmolsjoshi commented 2 years ago

@gante do you require any help with this issue? Happy to contribute

gante commented 2 years ago

Hi @anmolsjoshi 👋

If you are comfortable with debugging XLA, absolutely :) My recommendation would be to pick a model from "Models failing complex tests" (the others might require significant architecture changes), and to start debugging. The number 1 suspect is always the position embeddings, which may not be handling the case where past is padded. Let me know if you are up to it, and which model would you like to take!

dsuess commented 2 years ago

Hi @gante, I did have a bit of a poke around. I think the complex tests all fail for the same reason: those models have a setting max_position_embeddings that is set to 20 by default during testing and which is too short for the “slow” tests. Here’s a simple fix for those: https://github.com/dsuess/transformers/commit/4a3e27164ae941fcd649b8565d7d92a4552d689f. I’ll give the other ones a shot now

JuheonChu commented 1 year ago

Hello @gante, may I ask if there is anything that I can contribute?

gante commented 1 year ago

Hi JuheonChu 👋 Actually yes! I have a few unchecked models at the top, but I wouldn't recommend spending time there unless you plan to use those architectures -- they are infrequently used.

However, two popular models are currently failing their XLA tests with beam search:

Marian
OPT

You can see the failing test if you install from main (pip install --upgrade git+https://github.com/huggingface/transformers.git) and run it e.g. for OPT NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 py.test -vv tests/models/opt/test_modeling_tf_opt.py::TFOPTModelTest::test_xla_generate_slow

I haven't dived in yet, so I don't know what's the cause for the failure. You'll have to hop into debug mode and see what is breaking :)

JuheonChu commented 1 year ago

Can @katiele47 and I try working on them?

gante commented 1 year ago

@JuheonChu of course!

JuheonChu commented 1 year ago

@JuheonChu of course! @gante Are we figuring out the cause of the testing failures based on the clues as follows?

Error 1 Error 2 Error 3

gante commented 1 year ago

@JuheonChu yes. My suggestion would be to attempt to find where the numerical differences start from (between the XLA and the non-XLA version), using a debugger. Please note that you can't print variables with jit_compile=True, so you should set it to False. From there, the root cause is typically apparent.

Be warned, these sort of tasks sometimes are very time-consuming to complete :)

JuheonChu commented 1 year ago

@JuheonChu yes. My suggestion would be to attempt to find where the numerical differences start from (between the XLA and the non-XLA version), using a debugger. Please note that you can't print variables with jit_compile=True, so you should set it to False. From there, the root cause is typically apparent.

Be warned, these sort of tasks sometimes are very time-consuming to complete :)

Thank you very much for your valuable guidance! We will try and keep you updated!

katiele47 commented 1 year ago

Hi @gante, I've attempted to reproduce the failed XLA test on the OPT model using your suggested commands. The cause of error I had was somehow different from @JuheonChu's. Would you be able to verify if the following is the expected failing test output? If not, I assume it could be due to my local repo. Thanks!

soma2000-lang commented 1 year ago

@gante working on XLNet