[torchbench] The official benchmark for performance and accuracy check

shenh10 commented 2 months ago

❓ Questions and Help

Hi I found two available codebases for testing torchbench with pytorch/xla:

The one provided by pytorch official: https://github.com/pytorch/pytorch/tree/main/benchmarks/dynamo
Another one provided by pytorch/xla team: https://github.com/pytorch/xla/tree/master/benchmarks

However for the first codebase, it seems the support for dynamo + openxla backend would not trigger xla compilation actually. Is it no longer maintained?

And for the second one, I found it is able to test the performance, but has no way to validate the accuracy comparing to eager mode, while the first benchmark tool is able to do that. Any support for this?

Looking forward to your feedback.

JackCaoG commented 2 months ago

@zpcore Can you provide more details on how to run torchbench with pytorch/xla?

zpcore commented 2 months ago

Here is the configuration script we use to run torchbench on TPU/GPU: https://github.com/GoogleCloudPlatform/ml-auto-solutions/blob/master/dags/pytorch_xla/configs/pytorchxla_torchbench_config.py.

For example when targetting tpu, get_torchbench_tpu_config() is the main entry function that constructs all the commands including installing dependencies and torchbench models, run torchbench and upload result to gcs bucket (you may not need).

Similar to GPU but all commands are running in our torch_xla GPU release docker.

zpcore commented 2 months ago

However for the first codebase, it seems the support for dynamo + openxla backend would not trigger xla compilation actually. Is it no longer maintained?

I don't think they support openxla backend in native torchbench. We need to move model to xla devices, which is handled in https://github.com/pytorch/xla/tree/master/benchmarks.

And for the second one, I found it is able to test the performance, but has no way to validate the accuracy comparing to eager mode, while the first benchmark tool is able to do that. Any support for this?

We don't have plan to add the accuracy metric at this time.

shenh10 commented 2 months ago

Here is the configuration script we use to run torchbench on TPU/GPU: https://github.com/GoogleCloudPlatform/ml-auto-solutions/blob/master/dags/pytorch_xla/configs/pytorchxla_torchbench_config.py.

For example when targetting tpu, get_torchbench_tpu_config() is the main entry function that constructs all the commands including installing dependencies and torchbench models, run torchbench and upload result to gcs bucket (you may not need).

Similar to GPU but all commands are running in our torch_xla GPU release docker.

Thank you for your reply. I did not use Google Cloud. It seems that this mainly uses the running method of xla/benchmark/experimental_runner.py. I know how to use this, but unfortunately it seems that it does not support accuracy check. I would like to confirm with you that the benchmark under pytorch/benchmarks/dynamo/ (mainly torchbench.py and common.py) is not maintained by your group, as it seems to be partially supported, but not corrected supported.

shenh10 commented 2 months ago

Here is the configuration script we use to run torchbench on TPU/GPU: https://github.com/GoogleCloudPlatform/ml-auto-solutions/blob/master/dags/pytorch_xla/configs/pytorchxla_torchbench_config.py. For example when targetting tpu, get_torchbench_tpu_config() is the main entry function that constructs all the commands including installing dependencies and torchbench models, run torchbench and upload result to gcs bucket (you may not need). Similar to GPU but all commands are running in our torch_xla GPU release docker.

Thank you for your reply. I did not use Google Cloud. It seems that this mainly uses the running method of xla/benchmark/experimental_runner.py. I know how to use this, but it seems that it does not support accuracy check.

Okay, I think accuracy check is probably quite important. FYI, I modified torchbench under pytorch/benchmarks/dynamo to utilize its accuracy check method for correctness verification (I have moved both the model and example_inputs to the xla device). Below is the correctness verification I did for the remaining models after excluding examples that still could not run with pytorch/xla within torchbench. This may be a helpful reference for investigating correctness issues with pytorch/xla:

Environment: NVIDIA A100 80G with CUDA 12.1 Configuration: Default example batch size PyTorch version: torch 2.3.0-rc12 (compiled from source), pytorch/xla 2.3.0-rc12 (compiled from source)

Experimental Control groups: 1. dynamo-openxla vs eager 2. dynamo-inductor vs eager (Tolerance refers to https://github.com/pytorch/pytorch/blob/037615b989b37b1bf5eff0c031055fc8d1fbe5ae/torch/_dynamo/utils.py#L1303's tol.) Testing command example:

./benchmarks/dynamo/torchbench.py --device=cuda --iterations-per-run=1 --output=torchbench_training_fp32_xla.csv --output-directory=./reports_only --trace-on-xla --backend=openxla --accuracy --train --iterations=10 --xla-tolerance 0.1 --only=dcgan --float32

List the three control groups in the table, where: 1) red indicates accuracy check failure; 2) green indicates accuracy check pass 3) yellow indicates that the eager two-round execution itself cannot be aligned and can be ignored.

Let's note:

Exp1. dynamo-openxla tolerance=0.01 Exp2. dynamo-openxla tolerance=0.1 Exp3. dynamo-inductor tolerance=0.01

Comparing Exp1 and Exp3, it can be observed that under the same tolerance, dynamo inductor shows a performance very close to eager in terms of accuracy, while dynamo openxla has a significant accuracy difference.

Contrasting Exp1 and Exp2, it is noticeable that when relaxing the accuracy tolerance threshold to a relatively high level, some cases can pass. However, cases that still cannot pass under Exp2 are likely due to bugs in the compilation, such as the issue mentioned in https://github.com/pytorch/xla/issues/7042, which is present in most hf_xxx models and leads to incorrect calculations.

zpcore commented 2 months ago

Hi @shenh10 , thanks for checking the accuracy.

@JackCaoG , do you know what can cause the accuracy difference with the openxla backend? They are using the native torchbench script for the experiment.

JackCaoG commented 2 months ago

There are a couple possibility

It is something wrong with the PyTorch/XLA's dynamo implementation.
XLA:GPU perform some optimization which has implication on accuracy.

I think the easiest way to check is to use LazyTensor to run the model(pretty much just drop torch.compile and add a mark_step after the loss.backward) and compare the gradient of that with the native GPU runs. If the issue is in our dynamo we can try to figure out why. If the issue is in XLA:GPU, we will likely need to figure out which optimization pass is causing accuracy difference..

BTW we don't really recommend user to do torch.compile with training now(we only use dynamo for inference now), all of our training runs are with LazyTensor.

pytorch / xla

[torchbench] The official benchmark for performance and accuracy check #7040

❓ Questions and Help