Closed eawer closed 11 months ago
cc @muellerzr
When running on my machine via main
I do get the torch._inductor
warning, meaning that compilation is happening (and verified looking at accelerate). I'm running on two t4's so I may not see the direct speed impact we may expect, but I got 45s with, 15s without. @sgugger any thoughts on why it might not be faster?
I haven't tried torch.compile
on multiple GPUs as it wasn't ready when I was first experimenting.
I gave it another try, and torch_compile=True
actually gives some minor additional performance (~10%), but still, in logs there are no signs of a model compilation
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 |
| 0% 29C P8 16W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 |
| 0% 28C P8 16W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 |
| 0% 29C P8 16W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 29C P8 16W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
@eugene-kostrov can you report your versions of transformers
and accelerate
? Again when I was running this I could see logs when I was building from github/main on both :)
@muellerzr accelerate==0.21.0
tried both following transformers versions - the logs were the same
transformers==4.31.0
transformers==4.32.0.dev2
Can you try installing via:
pip install git+https://github.com/huggingface/accelerate git+https://github.com/huggingface/transformers
Thanks @eawer!
Interesting as I definitely see the logs here.
accelerate launch test.py
/home/zach_mueller_huggingface_co/.cache/huggingface/modules/datasets_modules/datasets/banking77/9898c11f6afa9521953d2ef205667b527bad14ef9cab445d470f16240c8c8ec4/banking77.py:59: FutureWarning: Dataset 'banking77' is deprecated and will be deleted. Use 'PolyAI/banking77' instead.
warnings.warn(
/home/zach_mueller_huggingface_co/.cache/huggingface/modules/datasets_modules/datasets/banking77/9898c11f6afa9521953d2ef205667b527bad14ef9cab445d470f16240c8c8ec4/banking77.py:59: FutureWarning: Dataset 'banking77' is deprecated and will be deleted. Use 'PolyAI/banking77' instead.
warnings.warn(
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
The speedups for torchdynamo mostly come wih GPU Ampere or higher and which is not detected here.
0%| | 0/24 [00:00<?, ?it/s]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[2023-08-16 18:49:33,566] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:33,587] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:34,954] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:34,993] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:36,700] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:36,731] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:38,004] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:38,031] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:39,525] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:39,529] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:40,767] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:40,777] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:42,030] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:42,071] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:43,559] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:43,603] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:44,823] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:44,878] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:46,085] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:46,171] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:47,346] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:47,440] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:48,851] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:48,952] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,139] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,254] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,801] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-08-16 18:49:50,932] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
{'loss': 4.3287, 'learning_rate': 4.791666666666667e-05, 'epoch': 0.12}
{'loss': 4.2453, 'learning_rate': 4.5833333333333334e-05, 'epoch': 0.25}
{'loss': 4.1773, 'learning_rate': 4.375e-05, 'epoch': 0.38}
{'loss': 4.0474, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.5}
{'loss': 3.9611, 'learning_rate': 3.958333333333333e-05, 'epoch': 0.62}
{'loss': 3.9228, 'learning_rate': 3.7500000000000003e-05, 'epoch': 0.75}
{'loss': 3.8479, 'learning_rate': 3.541666666666667e-05, 'epoch': 0.88}
{'loss': 3.7447, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}
{'eval_loss': 3.690765380859375, 'eval_f1': 0.26359295290537466, 'eval_runtime': 0.3893, 'eval_samples_per_second': 1315.035, 'eval_steps_per_second': 5.137, 'epoch': 1.0}
{'loss': 3.6982, 'learning_rate': 3.125e-05, 'epoch': 1.12}
{'loss': 3.6525, 'learning_rate': 2.916666666666667e-05, 'epoch': 1.25}
{'loss': 3.5546, 'learning_rate': 2.7083333333333332e-05, 'epoch': 1.38}
{'loss': 3.5015, 'learning_rate': 2.5e-05, 'epoch': 1.5}
{'loss': 3.4782, 'learning_rate': 2.2916666666666667e-05, 'epoch': 1.62}
{'loss': 3.4152, 'learning_rate': 2.0833333333333336e-05, 'epoch': 1.75}
{'loss': 3.3385, 'learning_rate': 1.8750000000000002e-05, 'epoch': 1.88}
{'loss': 3.3378, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}
{'eval_loss': 3.2540321350097656, 'eval_f1': 0.49614175520769455, 'eval_runtime': 0.2964, 'eval_samples_per_second': 1727.241, 'eval_steps_per_second': 6.747, 'epoch': 2.0}
{'loss': 3.2948, 'learning_rate': 1.4583333333333335e-05, 'epoch': 2.12}
{'loss': 3.2471, 'learning_rate': 1.25e-05, 'epoch': 2.25}
{'loss': 3.2197, 'learning_rate': 1.0416666666666668e-05, 'epoch': 2.38}
{'loss': 3.1782, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.5}
{'loss': 3.1959, 'learning_rate': 6.25e-06, 'epoch': 2.62}
{'loss': 3.1684, 'learning_rate': 4.166666666666667e-06, 'epoch': 2.75}
{'loss': 3.1546, 'learning_rate': 2.0833333333333334e-06, 'epoch': 2.88}
{'loss': 3.1194, 'learning_rate': 0.0, 'epoch': 3.0}
{'eval_loss': 3.0891151428222656, 'eval_f1': 0.604639735844818, 'eval_runtime': 0.2952, 'eval_samples_per_second': 1734.197, 'eval_steps_per_second': 6.774, 'epoch': 3.0}
{'train_runtime': 43.3381, 'train_samples_per_second': 141.769, 'train_steps_per_second': 0.554, 'train_loss': 3.576234668493271, 'epoch': 3.0}
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24/24 [00:43<00:00, 1.81s/it]
Using the pypi versions of accelerate and transformers on torch 2.0.1
@muellerzr sure
(test-comp) root@pytorch-2-0-0-gpu-p-ml-g5-12xlarge-abc:~# pip freeze | grep "transformers\|torch\|accelerate"
accelerate @ git+https://github.com/huggingface/accelerate@d087be01566477d99b660526adb7da4ec31abf1d
torch==2.0.1
transformers @ git+https://github.com/huggingface/transformers@1982dd3b15867c46e1c20645901b0de469fd935f
Here are results of this command for a single GPU (compilation works, ~42k lines) CUDA_VISIBLE_DEVICES=0 TRANSFORMERS_VERBOSITY=debug ACCELERATE_VERBOCITY=debug TORCH_COMPILE_DEBUG=1 TORCH_LOGS=dynamo,inductor,guards python test_comppiled.py 2>&1 | tee visible_devices_0.txt
:
visible_devices_0.txt
Here are results of this command for 4 GPUS (compilation does not happen, ~400 lines) CUDA_VISIBLE_DEVICES=0,1,2,3 TRANSFORMERS_VERBOSITY=debug ACCELERATE_VERBOCITY=debug TORCH_COMPILE_DEBUG=1 TORCH_LOGS=dynamo,inductor,guards python test_comppiled.py 2>&1 | tee visible_devices_0123.txt
:
visible_devices_0123.txt
@eawer the issue here is the fact the trainer doesn't support model parallelism for torch compile yet. If you use DDP (such as using accelerate launch
instead) it will run and log exactly as we expect. cc @SunMarc
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.31.0Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run this code:
Expected behavior
This code is running as expected on a machine with a single GPU. The model is compiled (there is an output that says layers are optimized and stuff), and training speeds up significantly (well, not for this specific example model/data combination, but for the production one). Compilation-related output:
But if I run the very same code on a machine with multiple GPUs - there are no signs of model compilation (no additional output in the logs) and the training speed does not improve.
nvidia-smi
output: